ADA2017 project, milestone 2

In [1]:
# To use Spark 1.6.3 in Jupyter Notebook we have to use Python 3.4
!python --version
Python 3.6.1 :: Anaconda custom (64-bit)
In [2]:
# USER to determine spark and dataset directories
import getpass
USER = getpass.getuser()
if USER == 'Tarmo':
    SPARK_DIR = 'C:/Users/Tarmo/Documents/Lausanne/CS-401_applied_data_analysis/spark/spark-1.6.3-bin-hadoop2.6'
    DATASET_DIR = 'C:/Users/Tarmo/Documents/Lausanne/data/Books_5.json'
    METADATA_DIR = 'C:/Users/Tarmo/Documents/Lausanne/data/meta_Books.json'
elif USER == 'adam':
    SPARK_DIR = '/home/adam/EPFL_courses/spark-1.6.3-bin-hadoop2.6'
    DATASET_DIR = '/media/adam/B236CB1D36CAE209/Studia/ADA/reviews_Books_5.json'
    METADATA_DIR = '/media/adam/B236CB1D36CAE209/Studia/ADA/meta_Books.json'
else:
    SPARK_DIR = '/home/adam/EPFL_courses/spark-1.6.3-bin-hadoop2.6'
In [3]:
SPARK_DIR
Out[3]:
'/home/adam/EPFL_courses/spark-1.6.3-bin-hadoop2.6'

Import libraries

Spark libraries

In [4]:
import findspark
findspark.init(SPARK_DIR)

import pyspark
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SQLContext
from pyspark.sql.functions import to_date, unix_timestamp, from_unixtime  #to_timestamp, pyspark 2.2 functiona 

The others

In [2]:
import pandas as pd
from scipy import stats
import numpy as np
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.utils import resample
from scipy import stats
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

import json
import re
import json 
import gzip
from collections import Counter
import operator

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline

Collection and aggregation of data

Loading and processing the data to see what it contains. To make further processing faster, we create intermediate well-structured csv files to avoid having to process the JSON data files every time we want to look at the data. We take a look at several different metrics and their aggregates in the data (review length, review number; by day, by month, by year, by book, etc.)

Initialization of Spark and SQL context

In [6]:
sc = SparkContext()
sqlContext = SQLContext(sc)

sc.version
Out[6]:
'1.6.3'

Loading of dataset and metadata

In [8]:
# Load the dataset and create RDDs
text_file = sc.textFile(DATASET_DIR)
# Convert previously read text file to json DataFrame
df = sqlContext.read.json(text_file)

# Load metadata for dataset and convert it to DataFrame
metadata = sc.textFile(METADATA_DIR)
metadata_df = sqlContext.read.json(metadata)

# Register DataFrames as tables to use those names in SQL-type queries
sqlContext.registerDataFrameAsTable(metadata_df, "metadata")
sqlContext.registerDataFrameAsTable(df, "dataset")
In [109]:
# Number of RDDs
text_file.getNumPartitions()
Out[109]:
282
In [110]:
# Number of records in the dataset
text_file.count()
Out[110]:
8898041
In [9]:
# Schema of the dataset
df.printSchema()
root
 |-- asin: string (nullable = true)
 |-- helpful: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewTime: string (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- unixReviewTime: long (nullable = true)

Initial data processing

In [10]:
# Bunch of castings, reviewTime: string->date, unixReviewTime: int->timestamp
# We do this to be able to better filter and manipulate the data
df = df.withColumn('unixReviewTime', from_unixtime(df['unixReviewTime']))
df = df.withColumn('reviewTime', to_date(df['unixReviewTime']))
df = df.withColumn('unixReviewTime', df['unixReviewTime'].cast('timestamp'))
df.printSchema()
root
 |-- asin: string (nullable = true)
 |-- helpful: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewTime: date (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- unixReviewTime: timestamp (nullable = true)

In [113]:
# Look at couple of records, just to be sure that we obtained what we wanted
df.select("reviewTime", 'reviewText', 'unixReviewTime').take(2)
Out[113]:
[Row(reviewTime=datetime.date(2012, 12, 16), reviewText='Spiritually and mentally inspiring! A book that allows you to question your morals and will help you discover who you really are!', unixReviewTime=datetime.datetime(2012, 12, 16, 1, 0)),
 Row(reviewTime=datetime.date(2003, 12, 11), reviewText="This is one my must have books. It is a masterpiece of spirituality. I'll be the first to admit, its literary quality isn't much. It is rather simplistically written, but the message behind it is so powerful that you have to read it. It will take you to enlightenment.", unixReviewTime=datetime.datetime(2003, 12, 11, 1, 0))]

Average length of reviews per day

In [ ]:
# Computing an average length of review per day
aTuple = (0, 0)
avg = df.select("reviewTime", 'reviewText').rdd.map(lambda row: (row.reviewTime, len(row.reviewText)))
avg = avg.aggregateByKey(aTuple, lambda a,b: (a[0] + b, a[1] + 1), lambda a,b: (a[0] + b[0], a[1] + b[1]))
avg = avg.mapValues(lambda v: v[0]/v[1])
avg = avg.collect()
In [ ]:
# Processing acquired data using Pandas
avg_len = pd.DataFrame(avg, columns=['Date', 'Avg_length'])
avg_len['Date'] = pd.to_datetime(avg_len['Date'])
avg_len.set_index('Date', inplace=True)
avg_len.sort_index(inplace=True)
avg_len.head()
In [ ]:
# Save to file not to compute this one more time
avg_len.to_csv("avg_length_review_by_day.csv")
In [6]:
avg_len = pd.read_csv("avg_length_review_by_day.csv", index_col='Date', parse_dates=['Date'])
avg_len.head()
Out[6]:
Avg_length
Date
1996-05-20 1752.0
1996-06-25 1123.0
1996-08-17 826.0
1996-09-18 897.0
1996-11-15 1757.0

Average length of reviews per month

In [115]:
monthly_data = avg_len.groupby(avg_len.index.to_period('M')).mean()
monthly_data.plot(figsize=(20,10))
Out[115]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe29587fba8>

We notice that the average review length per month appears to have three somewhat stable levels in time: before 2000, between 2000 and 2013 and after 2013. In the middle period the review length is consistently high, and crashes around 2012-2013. In the beginning of the dataset time period, the number of reviews is small so there is high variation in monthly average.

Average length of reviews between 2012 and 2013

In [116]:
monthly_data['2012':'2013'].plot(figsize=(20,10))
Out[116]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe295a040b8>

The extreme drop is located between August 2012 and January 2013.

In [ ]:
# Save processed data
monthly_data.to_csv("avg_length_review_by_month.csv")
In [ ]:
# (Successful) attempt to filter by timestamp 
tmp = df.rdd.filter(lambda row: row.unixReviewTime > pd.to_datetime('2012-05')
                     and row.unixReviewTime < pd.to_datetime('2013'))
tmp.take(5)

Number of reviews per each day

In [ ]:
# Number of reviews per each day
number_of_reviews = df.rdd.map(lambda row: (row.reviewTime, 1)).reduceByKey(lambda a, b: a+b).collect()
In [ ]:
# Processing the data
rev_num = pd.DataFrame(number_of_reviews, columns=['Date', 'Number of reviews'])
rev_num['Date'] = pd.to_datetime(rev_num['Date'])
rev_num.set_index('Date', inplace=True)
rev_num.sort_index(inplace=True)
rev_num.head()
In [ ]:
# Save the data not to compute over and over
rev_num.to_csv("number_of_reviews_per_day.csv")
In [9]:
# Read file with the data
rev_num = pd.read_csv("number_of_reviews_per_day.csv", index_col='Date', parse_dates=['Date'])
rev_num.tail()
Out[9]:
Number of reviews
Date
2014-07-19 2797
2014-07-20 3332
2014-07-21 3456
2014-07-22 2697
2014-07-23 619

Number of reviews per month

In [ ]:
monthly_data_reviews = rev_num.groupby(rev_num.index.to_period('M')).sum()
# Save the processed data
monthly_data_reviews.to_csv("number_of_reviews_per_month.csv")
In [118]:
# Read file with the data
monthly_data_reviews = pd.read_csv("number_of_reviews_per_month.csv", index_col='Date', parse_dates=['Date'])
In [119]:
monthly_data_reviews.plot(figsize=(20,10))
Out[119]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe295772978>

We see that in the beginning of the data period, there are hardly any reviews being written. The number of reviews increases consistently until there is a huge jump after 2012. The extreme increase coincides with the drop in the average review length, and may reflect a sudden change in reviewer numbers and/or behavior. The apparent increase might also be due to changes in Amazon review policies: after a significant change, Amazon might retroactively remove old reviews not conforming to the new policy.

Number of reviews per book

In [ ]:
number_of_reviews_per_book = df.rdd.map(lambda row: (row.asin, 1)).reduceByKey(lambda a, b: a+b).collect()
len(number_of_reviews_per_book)
In [ ]:
# Processing the data using Pandas and saving it to csv file
df_number_of_reviews_per_book = pd.DataFrame(number_of_reviews_per_book, columns=['Book_id', 'Number of reviews'])
df_number_of_reviews_per_book.sort_values('Number of reviews', ascending=False, inplace = True)
df_number_of_reviews_per_book.to_csv("number_of_reviews_per_book.csv", index=False)
df_number_of_reviews_per_book.head(10)
In [120]:
# Read file with the data
df_number_of_reviews_per_book = pd.read_csv("number_of_reviews_per_book.csv", index_col='Book_id')
In [129]:
# Boxplot of numbers of reviews
sns.boxplot(np.array(df_number_of_reviews_per_book['Number of reviews'].values), orient='v')
Out[129]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe28c0afd68>

The boxplot is not very box-like, since most books have only a small number of reviews.

We look at the numbers without the outliers:

In [130]:
sns.boxplot(np.array(df_number_of_reviews_per_book['Number of reviews'].values), orient='v', showfliers=False)
Out[130]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe28c0deb00>
In [131]:
# Basic statistics for this variable
stats.describe(df_number_of_reviews_per_book['Number of reviews'].values)
Out[131]:
DescribeResult(nobs=367982, minmax=(5, 7440), mean=24.180641987923323, variance=4396.0938613274502, skewness=25.747403776095645, kurtosis=1461.9790549248125)

Top 30 books based on the number of reviews

Our time-independent popularity metric is the number of reviews, so we take a closer look at the top 30 books according to it.

In [ ]:
top_books = df.rdd.map(lambda row: (row.asin, 1)).reduceByKey(lambda a, b: a+b).sortBy(lambda wc: -wc[1]).take(30)

top_books_df = sqlContext.createDataFrame(top_books, ['asin', 'rew_num'])
sqlContext.registerDataFrameAsTable(top_books_df, "top_books")
top_books_df.take(5)
In [ ]:
# Join the dataset table with the metadata table to see titles of the most popular books
sqlContext.sql("select t.asin, m.title from metadata m join top_books t on m.asin=t.asin").collect()
In [ ]:
sqlContext.sql("select t.asin, m.title from metadata m join top_books t limit 10").collect()

Two methods of filtering by ids of books

In [ ]:
most_reviewed_books_id_top = df_number_of_reviews_per_book[:30]
most_reviewed_top = df.rdd.filter(lambda row: row.asin in list(most_reviewed_books_id_top.Book_id))
            .map(lambda row: (row.asin, row.reviewTime)).collect()
In [ ]:
most_reviewed_top_2 = sqlContext.sql("select asin, reviewTime from dataset where asin in " + 
               str(tuple(most_reviewed_books_id_top.Book_id)))

Processing the data for top30 books

In [ ]:
# Processing the data using Pandas
most_reviewed_books_top30_df = pd.DataFrame(most_reviewed_top, columns=['asin', 'reviewTime'])
# Convert to datetime type
most_reviewed_books_top30_df['reviewTime'] = pd.to_datetime(most_reviewed_books_top30_df['reviewTime'])
# Assign number of review to compute the sum
most_reviewed_books_top30_df['Number_of_reviews'] = 1
# Create monthly period for aggregation purpose
most_reviewed_books_top30_df['Year-month'] = most_reviewed_books_top30_df['reviewTime'].dt.to_period('M')
most_reviewed_books_top30_df.head()
In [ ]:
# Save data describing number of reviews per day for each book in top30
most_reviewed_books_top30_df.groupby(['asin', 'reviewTime']).sum()
                            .to_csv("number_of_reviews_per_day_top30_books.csv")
In [ ]:
# Aggregating the data by month
m_rev_books_by_month = most_reviewed_books_top30_df.groupby(['asin', 'Year-month'], as_index=True).sum()
m_rev_books_by_month.to_csv("number_of_reviews_per_month_top30_books.csv")
In [2]:
m_rev_books_by_month = pd.read_csv("number_of_reviews_per_month_top30_books.csv").set_index(['asin', 'Year-month'])

Multi-line plot for top30 books - timeseries of reviews per month

In [3]:
number_of_reviews_per_month_top30_books_UNSTACKED_df = m_rev_books_by_month.unstack(level=0)
In [ ]:
number_of_reviews_per_month_top30_books_UNSTACKED_df.to_csv("number_of_reviews_per_month_top30_books_UNSTACKED.csv")
In [4]:
number_of_reviews_per_month_top30_books_UNSTACKED_df.plot(figsize = (20,10))
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9fa7c337f0>

We see that for the top 30 books, most reviews are made after January 2013.

In [135]:
number_of_reviews_per_month_top30_books_UNSTACKED_df['2013':].plot(figsize = (20,10))
Out[135]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe283d76ba8>

A top 30 book accrues a lot of reviews right after publication (or rather, appearance of first review). It is highly book-dependent whether this rate of reviewing drops quickly or slowly.

Cumulated number of reviews for top 30 books

In [9]:
number_of_reviews_per_month_top30_books_cumsum_df = number_of_reviews_per_month_top30_books_UNSTACKED_df.cumsum()
In [11]:
number_of_reviews_per_month_top30_books_cumsum_df.shape
Out[11]:
(192, 30)
In [10]:
number_of_reviews_per_month_top30_books_cumsum_df.plot(figsize = (20,10))
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9fa4717ef0>
# A way to use a method on each entry of a column fun = np.vectorize(lambda x: x.to_timestamp()) m_rev_books_by_month['Timestamp'] = fun(m_rev_books_by_month['Year-month'].values) m_rev_books_by_month.set_index('asin')

Average length and number of reviews per book

In [ ]:
aTuple = (0, 0)
avg_len_review = df.select('asin', 'reviewText').rdd.map(lambda row: (row.asin, len(row.reviewText)))
avg_len_review = avg_len_review.aggregateByKey(aTuple, lambda a,b: (a[0] + b, a[1] + 1), lambda a,b: (a[0] + b[0], a[1] + b[1]))
avg_len_review = avg_len_review.mapValues(lambda v: (v[0]/v[1], v[1]))
avg_len_review = avg_len_review.collect()
In [ ]:
# Transformation of the data to be able to load it as a DataFrame 
avg_len_review = [(k, v1, v2) for k, (v1, v2) in avg_len_review]
In [ ]:
# Processing and saving to file
avg_len_review_per_book_df = pd.DataFrame(avg_len_review, columns=['Book_id', 'Avg_len', 'number_of_reviews'])
avg_len_review_per_book_df.sort_values(['Avg_len', 'number_of_reviews'], ascending=False, inplace=True)
avg_len_review_per_book_df.to_csv("avg_length_and_number_of_reviews_per_book.csv", index=False)
avg_len_review_per_book_df.head(10)
In [247]:
avg_len_review_per_book_df = pd.read_csv("avg_length_and_number_of_reviews_per_book.csv",index_col="Book_id")
avg_len_review_per_book_df.head()
Out[247]:
Avg_len number_of_reviews
Book_id
0195396170 13459.5 6
0700619542 9934.8 5
0061312118 9511.1 10
0875522645 9377.8 5
1405159286 8716.0 9
In [8]:
avg_len_review_per_book_df.plot(x = 'Avg_len', y = 'number_of_reviews', kind = 'scatter', figsize = (20,10), 
                       alpha = 0.3, title = 'Number of reviews to average length for each book')
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc61135a4a8>
In [10]:
avg_len_review_per_book_df[avg_len_review_per_book_df['number_of_reviews'] > 10].plot(x = 'Avg_len', y = 'number_of_reviews', kind = 'scatter', figsize = (20,10), 
                       alpha = 0.3, title = 'Number of reviews to average length for books with 10 and more reviews')
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc611201390>

Collect reviews rating information

Book serial number, date of a review, rating, length of review.

In [14]:
ratings_with_length = df.select("asin", "unixReviewTime", "reviewerID", 'reviewText', 'overall', 'helpful').rdd.map(lambda row: (row.asin, row.unixReviewTime, row.reviewerID, len(row.reviewText), row.overall, row.helpful)).collect()
In [18]:
ratings_with_length_df = pd.DataFrame(ratings_with_length, columns = ["asin", "unixReviewTime", "reviewerID", 'review_length', 'overall', 'helpful'])
ratings_with_length_df['unixReviewTime'] = pd.to_datetime(ratings_with_length_df['unixReviewTime'])
ratings_with_length_df['overall']  = pd.to_numeric(ratings_with_length_df['overall'], downcast='unsigned')
ratings_with_length_df['review_length']  = pd.to_numeric(ratings_with_length_df['review_length'])
ratings_with_length_df.dtypes
In [ ]:
ratings_with_length_df['helpful_yes'] = ratings_with_length_df['helpful'].apply(lambda x: x[0])
ratings_with_length_df['helpful_yes']  = pd.to_numeric(ratings_with_length_df['helpful_yes'])
ratings_with_length_df['helpful_no'] = ratings_with_length_df['helpful'].apply(lambda x: x[1])
ratings_with_length_df['helpful_no']  = pd.to_numeric(ratings_with_length_df['helpful_no'])
ratings_with_length_df.drop('helpful', inplace = True, axis = 1)
In [40]:
ratings_with_length_df.to_csv("ratings_with_length.csv", index=False)
In [35]:
ratings_with_length_df.describe()
Out[35]:
review_length overall helpful_yes helpful_no
count 8.898041e+06 8.898041e+06 8.898041e+06 8.898041e+06
mean 8.189253e+02 4.249932e+00 3.007799e+00 4.050255e+00
std 1.035836e+03 1.057733e+00 1.967758e+01 2.263825e+01
min 0.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00
25% 1.920000e+02 4.000000e+00 0.000000e+00 0.000000e+00
50% 4.290000e+02 5.000000e+00 0.000000e+00 1.000000e+00
75% 1.051000e+03 5.000000e+00 2.000000e+00 3.000000e+00
max 3.265800e+04 5.000000e+00 2.331100e+04 2.421200e+04
In [49]:
fig, ax = plt.subplots()
fig.set_figwidth(20)
sns.boxplot(y = 'overall', x = 'review_length', data = ratings_with_length_df, orient='h',ax = ax)
Out[49]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9a40c1ee48>
In [50]:
fig, ax = plt.subplots()
fig.set_figwidth(20)
sns.boxplot(y = 'overall', x = 'review_length', data = ratings_with_length_df, orient='h', ax = ax, showfliers=False)
Out[50]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9a38bd0390>

Ratings

Collecting the data about ratings

Each row contains: book serial number, date of the review, rating (overall). We construct an intermediate file where each book (asin) is associated with five column features, which give the number of ratings from 1 to 5, respectively.

Rating is in stars (integers), from 1 to 5.

In [ ]:
ratings = df.select("asin", "reviewTime", "overall").rdd.map(lambda row: (row.asin, row.reviewTime, int(row.overall))).collect()
In [ ]:
# Create a DataFrame from the dataset collected before
ratings_df = pd.DataFrame(ratings, columns=['asin', 'Date', 'rating'])
ratings_df['Date'] = pd.to_datetime(ratings_df['Date'])
ratings_df.shape
In [144]:
ratings_df
Out[144]:
asin Date rating Year-month
0 000100039X 2012-12-16 5 2012-12
1 000100039X 2003-12-11 5 2003-12
2 000100039X 2014-01-18 5 2014-01
3 000100039X 2011-09-27 5 2011-09
4 000100039X 2002-10-07 5 2002-10
5 000100039X 2014-01-27 5 2014-01
6 000100039X 2008-03-28 5 2008-03
7 000100039X 2013-11-03 5 2013-11
8 000100039X 2014-01-29 5 2014-01
9 000100039X 2013-09-22 5 2013-09
10 000100039X 2009-01-15 5 2009-01
11 000100039X 2013-11-20 5 2013-11
12 000100039X 2014-03-16 5 2014-03
13 000100039X 2001-02-28 5 2001-02
14 000100039X 2013-01-23 5 2013-01
15 000100039X 2012-06-27 5 2012-06
16 000100039X 2012-07-16 5 2012-07
17 000100039X 2005-11-16 5 2005-11
18 000100039X 2012-02-15 5 2012-02
19 000100039X 2003-10-13 5 2003-10
20 000100039X 2013-08-26 5 2013-08
21 000100039X 2001-02-24 5 2001-02
22 000100039X 2000-07-13 3 2000-07
23 000100039X 2008-03-21 2 2008-03
24 000100039X 2013-06-17 5 2013-06
25 000100039X 2012-03-12 5 2012-03
26 000100039X 2004-10-03 5 2004-10
27 000100039X 2014-01-15 5 2014-01
28 000100039X 2013-09-03 5 2013-09
29 000100039X 2014-01-26 5 2014-01
... ... ... ... ...
8898011 B00M0RE7CS 2014-07-23 5 2014-07
8898012 B00M0RE7CS 2014-07-23 5 2014-07
8898013 B00M0RE7CS 2014-07-23 5 2014-07
8898014 B00M0RE7CS 2014-07-22 5 2014-07
8898015 B00M0RE7CS 2014-07-22 5 2014-07
8898016 B00M0RE7CS 2014-07-23 5 2014-07
8898017 B00M0RE7CS 2014-07-22 5 2014-07
8898018 B00M0RE7CS 2014-07-22 5 2014-07
8898019 B00M0RE7CS 2014-07-23 5 2014-07
8898020 B00M0RE7CS 2014-07-23 5 2014-07
8898021 B00M0RE7CS 2014-07-23 5 2014-07
8898022 B00M0RE7CS 2014-07-22 5 2014-07
8898023 B00M0RE7CS 2014-07-22 5 2014-07
8898024 B00M0RE7CS 2014-07-22 5 2014-07
8898025 B00M0RE7CS 2014-07-22 5 2014-07
8898026 B00M0RE7CS 2014-07-22 5 2014-07
8898027 B00M0RE7CS 2014-07-22 5 2014-07
8898028 B00M0RE7CS 2014-07-22 5 2014-07
8898029 B00M0RE7CS 2014-07-22 5 2014-07
8898030 B00M0RE7CS 2014-07-23 5 2014-07
8898031 B00M0RE7CS 2014-07-22 5 2014-07
8898032 B00M0RE7CS 2014-07-23 5 2014-07
8898033 B00M0RE7CS 2014-07-22 5 2014-07
8898034 B00M13FNSS 2014-07-23 4 2014-07
8898035 B00M13FNSS 2014-07-23 5 2014-07
8898036 B00M13FNSS 2014-07-23 5 2014-07
8898037 B00M13FNSS 2014-07-23 5 2014-07
8898038 B00M13FNSS 2014-07-23 5 2014-07
8898039 B00M13FNSS 2014-07-23 5 2014-07
8898040 B00M13FNSS 2014-07-23 5 2014-07

8898041 rows × 4 columns

In [3]:
# Save the data
ratings_df.to_csv("ratings.csv", index=False)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-f5c6d2b776b8> in <module>()
      1 # Save the data
----> 2 ratings_df.to_csv("ratings.csv", index=False)

NameError: name 'ratings_df' is not defined
In [195]:
# Read the data from the file
ratings_df = pd.read_csv("ratings.csv", parse_dates=['Date'])

Number of 1, 2, 3, 4, 5 - star ratings in each month

In [198]:
# Number of 1, 2, 3, 4, 5 - star ratings in each month
ratings_df['Year-month'] = ratings_df['Date'].dt.to_period('M')
ratings_grouped_df = ratings_df[['Year-month', 'rating', 'asin']].groupby(['Year-month', 'rating']).count()
ratings_grouped_unstack_date_df = ratings_grouped_df.unstack(level=1)
ratings_grouped_unstack_date_df.head()
Out[198]:
asin
rating 1 2 3 4 5
Year-month
1996-05 NaN NaN NaN 1.0 NaN
1996-06 1.0 NaN NaN NaN NaN
1996-08 NaN NaN NaN NaN 1.0
1996-09 NaN NaN NaN NaN 1.0
1996-11 NaN 2.0 NaN 2.0 2.0
In [5]:
ratings_grouped_unstack_date_df.plot(figsize = (20,10))
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3eef290ac8>

Number of each 1, 2, 3, 4, 5 ratings for each book

In [17]:
# Aggregate data by number of each 1, 2, 3, 4, 5 ratings for each book

ratings_grouped_asin_df = ratings_df[['Year-month', 'rating', 'asin']].groupby(['asin', 'rating']).count()
ratings_grouped_asin_unstack_df = ratings_grouped_asin_df.unstack(level=1, fill_value=0)
ratings_grouped_asin_unstack_df.columns = ratings_grouped_asin_unstack_df.columns.droplevel()
ratings_grouped_asin_unstack_df.head()
Out[17]:
rating 1 2 3 4 5
asin
000100039X 6 4 8 15 173
0001055178 0 4 2 10 2
0001473123 1 0 0 2 13
0001473727 0 0 0 0 7
0001473905 0 0 1 0 5
In [18]:
# Save aggregated data to file
ratings_grouped_asin_unstack_df.to_csv("ratings_grouped_unstack.csv")
In [242]:
# Read data from file
ratings_grouped_unstack_df = pd.read_csv('ratings_grouped_unstack.csv', index_col='asin')
ratings_grouped_unstack_df.rename(columns={1:"1", 2:"2", 3:"3", 4:"4", 5:"5"}, inplace=True)
ratings_grouped_unstack_df.head()
Out[242]:
1 2 3 4 5
asin
000100039X 6 4 8 15 173
0001055178 0 4 2 10 2
0001473123 1 0 0 2 13
0001473727 0 0 0 0 7
0001473905 0 0 1 0 5

Column 1 contains the number of 1-star ratings, etc.

Mean and standard deviation of ratings for each book

In [244]:
# Calculating weighted mean and standard deviation
def weighted_avg_and_std(values, weights):
    """
    Return the weighted average and standard deviation.

    values, weights -- Numpy ndarrays with the same shape.
    """
    average = np.average(values, weights=weights, axis=1)
    variance = np.average(((values.T - average).T)**2, weights=weights, axis=1)
    return (average, np.sqrt(variance))
In [245]:
ratings_grouped_unstack_df['Weighted_mean'], ratings_grouped_unstack_df[
    'Std_dev'] = weighted_avg_and_std(
        values=np.broadcast_to([1, 2, 3, 4, 5],
                               (ratings_grouped_unstack_df.shape[0], 5)),
        weights=np.array(ratings_grouped_unstack_df[['1', '2', '3', '4',
                                                     '5']].values))

ratings_grouped_unstack_df['Number_of_reviews'] = np.sum(
    np.array(ratings_grouped_unstack_df[['1', '2', '3', '4', '5']].values),
    axis=1)

ratings_grouped_unstack_df.head()
Out[245]:
1 2 3 4 5 Weighted_mean Std_dev Number_of_reviews
asin
000100039X 6 4 8 15 173 4.674757 0.873584 206
0001055178 0 4 2 10 2 3.555556 0.955814 18
0001473123 1 0 0 2 13 4.625000 0.992157 16
0001473727 0 0 0 0 7 5.000000 0.000000 7
0001473905 0 0 1 0 5 4.666667 0.745356 6
In [248]:
ratings_grouped_unstack_df = pd.merge(
    ratings_grouped_unstack_df,
    avg_len_review_per_book_df,
    right_index=True,
    left_index=True)
ratings_grouped_unstack_df.drop('number_of_reviews', inplace=True, axis=1)
ratings_grouped_unstack_df.head()
Out[248]:
1 2 3 4 5 Weighted_mean Std_dev Number_of_reviews Avg_len
000100039X 6 4 8 15 173 4.674757 0.873584 206 638.033981
0001055178 0 4 2 10 2 3.555556 0.955814 18 1308.611111
0001473123 1 0 0 2 13 4.625000 0.992157 16 287.937500
0001473727 0 0 0 0 7 5.000000 0.000000 7 187.571429
0001473905 0 0 1 0 5 4.666667 0.745356 6 481.833333
In [243]:
 
In [241]:
ratings_grouped_unstack_df.describe()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-241-8bd14405bb63> in <module>()
----> 1 ratings_grouped_unstack_df.describe()

NameError: name 'ratings_grouped_unstack_df' is not defined

Plots

In [224]:
g = sns.pairplot(ratings_grouped_unstack_df)
g.set(xticklabels=[])
Out[224]:
<seaborn.axisgrid.PairGrid at 0x7fc5f0d3d278>
In [26]:
ratings_grouped_unstack_df.plot(y = 'Std_dev', kind = 'hist', bins = 40, figsize = (20,7), 
        title = 'Standard deviation of ratings for books')
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3f2c600978>
In [27]:
ratings_grouped_unstack_df[ratings_grouped_unstack_df['Number_of_reviews'] > 9].plot(y = 'Std_dev', 
        kind = 'hist', bins = 40, figsize = (20,7), 
        title = 'Standard deviation of ratings for books with 10 or more reviews')
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3ebdb39b38>
In [80]:
ratings_grouped_unstack_df.plot(y = 'Std_dev', kind = 'box')
Out[80]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc608618198>
In [28]:
ratings_grouped_unstack_df.plot(x = 'Weighted_mean', y = 'Std_dev', kind = 'scatter', alpha = 0.3, figsize = (20,7))
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3ebd33e940>
In [29]:
ratings_grouped_unstack_df[ratings_grouped_unstack_df['Number_of_reviews'] > 9].plot(x = 'Weighted_mean', y = 'Std_dev', kind = 'scatter', alpha = 0.3, figsize = (20,7))
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3ebb03ecc0>
In [30]:
ratings_grouped_unstack_df.plot(x = 'Weighted_mean', y = 'Number_of_reviews', kind = 'scatter', alpha = 0.3, figsize = (15,7))
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3ebd2b00f0>
In [31]:
ratings_grouped_unstack_df.plot(x = 'Weighted_mean', y = 'Avg_len', kind = 'scatter', figsize = (20,7), alpha = 0.3)
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3ebaf18780>
In [32]:
ratings_grouped_unstack_df[ratings_grouped_unstack_df['Number_of_reviews'] > 100].plot(x = 'Std_dev', y = 'Avg_len', kind = 'scatter', figsize = (20,7), alpha = 0.3)
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3ebae416a0>

High vs. low variance test - do they have the same average review lenght?

In [36]:
books_low_std = ratings_grouped_unstack_df[ratings_grouped_unstack_df['Std_dev'] < 0.507]
books_high_std = ratings_grouped_unstack_df[ratings_grouped_unstack_df['Std_dev'] > 1.18]
print(books_low_std.shape[0], books_high_std.shape[0])
91945 73906
In [37]:
books_low_std.describe()
Out[37]:
1 2 3 4 5 Weighted_mean Std_dev Number_of_reviews Avg_len
count 91945.000000 91945.000000 91945.000000 91945.000000 91945.000000 91945.000000 91945.000000 91945.000000 91945.000000
mean 0.007320 0.005351 0.075186 2.132003 7.780184 4.743573 0.332847 10.000044 850.340227
std 0.221183 0.119972 0.441639 2.430834 13.635870 0.276237 0.181190 15.338557 570.802787
min 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 5.000000 63.000000
25% 0.000000 0.000000 0.000000 1.000000 4.000000 4.625000 0.300000 5.000000 430.387097
50% 0.000000 0.000000 0.000000 2.000000 5.000000 4.800000 0.400000 7.000000 710.833333
75% 0.000000 0.000000 0.000000 3.000000 8.000000 4.900000 0.471405 10.000000 1117.833333
max 20.000000 12.000000 27.000000 125.000000 1604.000000 5.000000 0.506812 1731.000000 8716.000000
In [38]:
books_high_std.describe()
Out[38]:
1 2 3 4 5 Weighted_mean Std_dev Number_of_reviews Avg_len
count 73906.000000 73906.000000 73906.000000 73906.000000 73906.000000 73906.000000 73906.000000 73906.000000 73906.000000
mean 2.817782 2.212107 3.098273 4.711634 9.610140 3.697671 1.390584 22.449936 901.369553
std 9.829752 7.157018 10.577516 15.644846 31.245404 0.472719 0.160412 69.236831 586.023271
min 0.000000 0.000000 0.000000 0.000000 0.000000 1.444444 1.180041 5.000000 103.000000
25% 1.000000 0.000000 0.000000 1.000000 3.000000 3.428571 1.259964 7.000000 477.461538
50% 1.000000 1.000000 1.000000 2.000000 5.000000 3.777778 1.356466 10.000000 781.400000
75% 3.000000 2.000000 3.000000 4.000000 8.000000 4.000000 1.490712 19.000000 1174.356061
max 1565.000000 640.000000 1209.000000 2065.000000 2959.000000 4.600000 2.000000 7440.000000 13459.500000
In [39]:
fig, ax1 = plt.subplots(figsize=(15, 4))
bp = sns.boxplot(data = [books_low_std['Avg_len'].values, books_high_std['Avg_len'].values], orient = 'h')
ax1.set_yticklabels(['Low std','High std'])
ax1.set_xlabel("Average review length")
Out[39]:
<matplotlib.text.Text at 0x7f3ebae006a0>
In [40]:
stats.ks_2samp(books_low_std['Avg_len'].values, books_high_std['Avg_len'].values)
Out[40]:
Ks_2sampResult(statistic=0.057000797657238644, pvalue=3.4237576578318584e-116)

We can reject hypothesis that these two samples come from the same distribution, so we proved that books with high std dev have on average longer reviews that those with low std_dev.

In [45]:
fig, ax1 = plt.subplots(figsize=(15, 6))
bp = sns.boxplot(data = [books_low_std['Avg_len'].values, books_high_std['Avg_len'].values], orient = 'h', showfliers=False)
ax1.set_yticklabels(['Low std','High std'])
ax1.set_xlabel("Average review length")
Out[45]:
<matplotlib.text.Text at 0x7f3ebd07e320>
In [42]:
def draw_plot(data, offset,edge_color, fill_color):
    pos = np.arange(data.shape[1])+offset 
    bp = ax.boxplot(data, positions = pos, widths=0.3, patch_artist=True, manage_xticks=False)
#    for element in ['boxes', 'whiskers', 'fliers', 'medians', 'caps']:
#        plt.setp(bp[element], color=edge_color)
#    for patch in bp['boxes']:
#        patch.set(facecolor=fill_color)
In [44]:
fig, ax = plt.subplots(figsize = (20, 10))
draw_plot(books_low_std.ix[:, books_low_std.columns != "Avg_len"].values, -0.2, "tomato", "white")
draw_plot(books_high_std.ix[:, books_high_std.columns != "Avg_len"].values, +0.2,"skyblue", "white")
ax.set_xticks(np.arange(books_low_std.shape[1] - 1))
ax.set_xticklabels(books_low_std.columns)
ax.set_ylim([-5, 50])
#plt.savefig(__file__+'.png', bbox_inches='tight')
plt.show()
plt.close()

Popularity metrics

1st popularity measure

Overall popularity measure (#reviews / #days from first review)

In [235]:
# Overall popularity measure (#reviews / #days from first review)
ratings_gb_df = ratings_df[['asin', 'Date', 'rating']].groupby('asin').agg({
    'Date':
    np.min,
    'rating':
    'count'
})
ratings_gb_df.columns = ['first_review_date', 'number_of_reviews']
last_review = ratings_df.Date.max()
ratings_gb_df[
    'days_since_first_review'] = last_review - ratings_gb_df['first_review_date']
ratings_gb_df['days_since_first_review'] = ratings_gb_df[
    'days_since_first_review'].astype('timedelta64[D]')
ratings_gb_df['popularity_factor'] = ratings_gb_df[
    'number_of_reviews'] / ratings_gb_df['days_since_first_review']
popularity_measure1_df = ratings_gb_df.sort_values(
    'popularity_factor', ascending=False)
popularity_measure1_df.head()
Out[235]:
first_review_date number_of_reviews days_since_first_review popularity_factor
asin
B00M13FNSS 2014-07-23 7 0.0 inf
B00LZKMXBI 2014-07-21 160 2.0 80.000000
B00LBFVNQS 2014-07-22 32 1.0 32.000000
B00M0RE7CS 2014-07-22 30 1.0 30.000000
1471133524 2014-07-01 515 22.0 23.409091

2nd popularity measure

Computing popularity measure equal to the biggest 3-month-period review increase (the biggest first derivative over accumulated number of reviews with delta equal 3 month)

In [143]:
# DataFrame having all possible months 
first_review = ratings_df.Date.min()
first_review_month = first_review.to_period("M")

last_review = ratings_df.Date.max()
last_review_month = last_review.to_period("M")

month_year_timespan = pd.date_range(first_review_month.to_timestamp(), last_review_month.to_timestamp(), freq = 'M').to_period('M')
month_year_timespan_df = pd.DataFrame(month_year_timespan, columns = ['Year-month'])
In [162]:
def pop_measure_2(df, period):
    df = df.groupby(['asin', "Year-month"]).count()
    df = df.reset_index()

    df = pd.merge(
        df,
        month_year_timespan_df,
        how='right',
        right_on='Year-month',
        left_on="Year-month")
    df.fillna(0, inplace=True)
    df.sort_values("Year-month", inplace=True)

    df['cumsum_rating'] = df['rating'].cumsum()
    df['rating_shift'] = df['cumsum_rating'].shift(period)
    df['month_diff'] = df['cumsum_rating'] - df['rating_shift']
    return df[['asin', 'Year-month',
               'month_diff']].loc[df['month_diff'].idxmax()]
In [200]:
# Computing popularity measure equal to the biggest 3-month-period review increase
# (the biggest first derivative over accumulated number of reviews with delta eqaul 3 month)
period = 3
ratings_gb_df = ratings_df[['asin', 'Year-month', 'rating']].groupby('asin')
popularity_measure2_df = pd.DataFrame()

for idx, book in enumerate(ratings_gb_df):
    temp = pd.DataFrame(book[1], copy=True)
    popularity_measure2_df = pd.concat([popularity_measure2_df, pop_measure_2(temp, period)])

asins = popularity_measure2_df.values[::3].flatten()
ym = popularity_measure2_df.values[1::3].flatten()
max_diff = popularity_measure2_df.values[2::3].flatten()
popularity_measure2_df = pd.DataFrame.from_dict({'asin': asins, 'Year-month': ym, 'month_diff': max_diff})
Another thousand books added 187000
Another thousand books added 188000
Another thousand books added 189000
Another thousand books added 190000
Another thousand books added 191000
Another thousand books added 192000
Another thousand books added 193000
Another thousand books added 194000
Another thousand books added 195000
Another thousand books added 196000
Another thousand books added 197000
Another thousand books added 198000
Another thousand books added 199000
Another thousand books added 200000
Another thousand books added 201000
Another thousand books added 202000
Another thousand books added 203000
Another thousand books added 204000
Another thousand books added 205000
Another thousand books added 206000
Another thousand books added 207000
Another thousand books added 208000
Another thousand books added 209000
Another thousand books added 210000
Another thousand books added 211000
Another thousand books added 212000
Another thousand books added 213000
Another thousand books added 214000
Another thousand books added 215000
Another thousand books added 216000
Another thousand books added 217000
Another thousand books added 218000
Another thousand books added 219000
Another thousand books added 220000
Another thousand books added 221000
Another thousand books added 222000
Another thousand books added 223000
Another thousand books added 224000
Another thousand books added 225000
Another thousand books added 226000
Another thousand books added 227000
Another thousand books added 228000
Another thousand books added 229000
Another thousand books added 230000
Another thousand books added 231000
Another thousand books added 232000
Another thousand books added 233000
Another thousand books added 234000
Another thousand books added 235000
Another thousand books added 236000
Another thousand books added 237000
Another thousand books added 238000
Another thousand books added 239000
Another thousand books added 240000
Another thousand books added 241000
Another thousand books added 242000
Another thousand books added 243000
Another thousand books added 244000
Another thousand books added 245000
Another thousand books added 246000
Another thousand books added 247000
Another thousand books added 248000
Another thousand books added 249000
Another thousand books added 250000
Another thousand books added 251000
Another thousand books added 252000
Another thousand books added 253000
Another thousand books added 254000
Another thousand books added 255000
Another thousand books added 256000
Another thousand books added 257000
Another thousand books added 258000
Another thousand books added 259000
Another thousand books added 260000
Another thousand books added 261000
Another thousand books added 262000
Another thousand books added 263000
Another thousand books added 264000
Another thousand books added 265000
Another thousand books added 266000
Another thousand books added 267000
Another thousand books added 268000
Another thousand books added 269000
Another thousand books added 270000
Another thousand books added 271000
Another thousand books added 272000
Another thousand books added 273000
Another thousand books added 274000
Another thousand books added 275000
Another thousand books added 276000
Another thousand books added 277000
Another thousand books added 278000
Another thousand books added 279000
Another thousand books added 280000
Another thousand books added 281000
Another thousand books added 282000
Another thousand books added 283000
Another thousand books added 284000
Another thousand books added 285000
Another thousand books added 286000
Another thousand books added 287000
Another thousand books added 288000
Another thousand books added 289000
Another thousand books added 290000
Another thousand books added 291000
Another thousand books added 292000
Another thousand books added 293000
Another thousand books added 294000
Another thousand books added 295000
Another thousand books added 296000
Another thousand books added 297000
Another thousand books added 298000
Another thousand books added 299000
Another thousand books added 300000
Another thousand books added 301000
Another thousand books added 302000
Another thousand books added 303000
Another thousand books added 304000
Another thousand books added 305000
Another thousand books added 306000
Another thousand books added 307000
Another thousand books added 308000
Another thousand books added 309000
Another thousand books added 310000
Another thousand books added 311000
Another thousand books added 312000
Another thousand books added 313000
Another thousand books added 314000
Another thousand books added 315000
Another thousand books added 316000
Another thousand books added 317000
Another thousand books added 318000
Another thousand books added 319000
Another thousand books added 320000
Another thousand books added 321000
Another thousand books added 322000
Another thousand books added 323000
Another thousand books added 324000
Another thousand books added 325000
Another thousand books added 326000
Another thousand books added 327000
Another thousand books added 328000
Another thousand books added 329000
Another thousand books added 330000
Another thousand books added 331000
Another thousand books added 332000
Another thousand books added 333000
Another thousand books added 334000
Another thousand books added 335000
Another thousand books added 336000
Another thousand books added 337000
Another thousand books added 338000
Another thousand books added 339000
Another thousand books added 340000
Another thousand books added 341000
Another thousand books added 342000
Another thousand books added 343000
Another thousand books added 344000
Another thousand books added 345000
Another thousand books added 346000
Another thousand books added 347000
Another thousand books added 348000
Another thousand books added 349000
Another thousand books added 350000
Another thousand books added 351000
Another thousand books added 352000
Another thousand books added 353000
Another thousand books added 354000
Another thousand books added 355000
Another thousand books added 356000
Another thousand books added 357000
Another thousand books added 358000
Another thousand books added 359000
Another thousand books added 360000
Another thousand books added 361000
Another thousand books added 362000
Another thousand books added 363000
Another thousand books added 364000
Another thousand books added 365000
Another thousand books added 366000
Another thousand books added 367000
In [204]:
popularity_measure2_df = pd.read_csv("popularity_measure_2_part1.csv")
popularity_measure2_df = pd.concat([popularity_measure2_df, pd.read_csv("popularity_measure_2_part2.csv")])
popularity_measure2_df.reset_index(drop=True, inplace=True)

# Check for 0 valued 'asin' column entries that could happen in case of not being 3 months on the market (books having first review in the last 3 months)
unique_asin = set(ratings_df['asin'].values)
popularity_asins = set(popularity_measure2_df['asin'].values)
t = unique_asin - popularity_asins
t = list(t)

popularity_measure2_df = popularity_measure2_df.drop(popularity_measure2_df[popularity_measure2_df['asin'] == '0'].index)
popularity_measure2_df.to_csv("popularity_measure_2.csv", index = False)
Out[204]:
Year-month asin month_diff
0 2014-01 000100039X 19.0
1 2000-03 0001055178 4.0
2 2013-02 0001473123 4.0
3 2014-03 0001473727 3.0
4 2013-02 0001473905 3.0
5 2013-01 0001712772 3.0
6 2013-08 000171287X 3.0
7 2001-01 0001714538 1.0
8 2003-09 0002005395 2.0
9 2013-02 0002006715 2.0
10 2009-12 0002007649 3.0
11 2011-05 0002007770 280.0
12 2014-04 0002008572 15.0
13 2014-05 0002051850 19.0
14 2004-05 0002111306 2.0
15 2014-03 0002113570 5.0
16 2000-05 0002117088 5.0
17 2003-04 0002153211 1.0
18 2000-06 0002154129 1.0
19 2001-11 0002154463 1.0
20 2014-03 000215725X 5.0
21 2001-03 0002157330 1.0
22 2013-07 0002158388 3.0
23 2000-06 000215949X 2.0
24 2012-05 0002178559 2.0
25 2004-07 0002179911 1.0
26 2013-11 0002185385 6.0
27 2006-04 000221685X 2.0
28 2011-02 0002216973 2.0
29 2000-04 000221735X 1.0
... ... ... ...
181809 1996-08 0 0.0
181810 1996-08 0 0.0
181811 1996-08 0 0.0
181812 1996-08 0 0.0
181813 1996-08 0 0.0
181814 1996-08 0 0.0
181815 1996-08 0 0.0
181816 1996-08 0 0.0
181817 1996-08 0 0.0
181818 1996-08 0 0.0
181819 1996-08 0 0.0
181820 1996-08 0 0.0
181821 1996-08 0 0.0
181822 1996-08 0 0.0
181823 1996-08 0 0.0
181824 1996-08 0 0.0
181825 1996-08 0 0.0
181826 1996-08 0 0.0
181827 1996-08 0 0.0
181828 1996-08 0 0.0
181829 1996-08 0 0.0
181830 1996-08 0 0.0
181831 1996-08 0 0.0
181832 1996-08 0 0.0
181833 1996-08 0 0.0
181834 1996-08 0 0.0
181835 1996-08 0 0.0
181836 1996-08 0 0.0
181837 1996-08 0 0.0
181838 1996-08 0 0.0

367983 rows × 3 columns

In [ ]:
popularity_measure2_df = pd.read_csv("popularity_measure_2.csv")

Complete popularity DataFrame

In [ ]:
popularity_measure_df = pd.merge(
    popularity_measure1_df,
    popularity_measure2_df,
    how='left',
    left_index=True,
    right_on='asin')
popularity_measure_df.reset_index(inplace=True)

All statistics gathered in one DataFrame

Statistics for each book:

  • Serial number - asin
  • Number of 1, 2, 3, 4, 5 - star ratings
  • Mean of ratings
  • Standard deviation of ratings
  • Number of reviews
  • Average review length
  • Date of the first review
  • Number of days since the first review
  • 1st popularity metric
  • 2nd popularity metric
  • Date of the highest value of 2nd popularity metric
In [251]:
complete_books_df = pd.merge(ratings_grouped_unstack_df, popularity_measure_df, how='left', left_index=True, right_on='asin')
complete_books_df.drop('index', axis = 1, inplace=True)
complete_books_df.reset_index(inplace=True, drop=True)
In [253]:
complete_books_df = complete_books_df.rename(
    index=str,
    columns={
        "popularity_factor": "pop_metric_1",
        "month_diff": "pop_metric_2",
        "Weighted_mean": "mean",
        "Year-month": "pop_metric_2_date"
    })
complete_books_df['pop_metric_2_date'] = pd.to_datetime(
    complete_books_df['pop_metric_2_date']).dt.to_period("M")
complete_books_df.drop("Number_of_reviews", axis = 1, inplace=True)
complete_books_df
Out[253]:
1 2 3 4 5 mean Std_dev Number_of_reviews Avg_len first_review_date number_of_reviews days_since_first_review pop_metric_1 pop_metric_2_date asin pop_metric_2
0 6 4 8 15 173 4.674757 0.873584 206 638.033981 2000-02-20 206 5267.0 0.039111 2014-01 000100039X 19.0
1 0 4 2 10 2 3.555556 0.955814 18 1308.611111 1998-12-13 18 5701.0 0.003157 2000-03 0001055178 4.0
2 1 0 0 2 13 4.625000 0.992157 16 287.937500 2009-11-27 16 1699.0 0.009417 2013-02 0001473123 4.0
3 0 0 0 0 7 5.000000 0.000000 7 187.571429 2011-06-02 7 1147.0 0.006103 2014-03 0001473727 3.0
4 0 0 1 0 5 4.666667 0.745356 6 481.833333 2009-04-19 6 1921.0 0.003123 2013-02 0001473905 3.0
5 1 0 0 1 13 4.666667 1.011050 15 334.266667 2000-06-17 15 5149.0 0.002913 2013-01 0001712772 3.0
6 0 0 2 1 9 4.583333 0.759203 12 409.333333 2001-10-01 12 4678.0 0.002565 2013-08 000171287X 3.0
7 1 1 0 2 3 3.714286 1.484615 7 450.000000 2001-01-22 7 4930.0 0.001420 2001-01 0001714538 1.0
8 0 0 5 3 6 4.071429 0.883523 14 1468.642857 2003-08-27 14 3983.0 0.003515 2003-09 0002005395 2.0
9 0 0 1 0 4 4.600000 0.800000 5 432.400000 2011-01-21 5 1279.0 0.003909 2013-02 0002006715 2.0
10 0 4 3 1 1 2.888889 0.993808 9 2616.000000 2008-12-10 9 2051.0 0.004388 2009-12 0002007649 3.0
11 79 133 227 593 1689 4.352444 1.018211 2721 591.721793 2006-05-25 2721 2981.0 0.912781 2011-05 0002007770 280.0
12 0 0 0 4 13 4.764706 0.424183 17 251.705882 2013-12-26 17 209.0 0.081340 2014-04 0002008572 15.0
13 4 13 19 53 175 4.446970 0.931880 264 1371.348485 1997-10-13 264 6127.0 0.043088 2014-05 0002051850 19.0
14 0 1 1 1 4 4.142857 1.124858 7 796.285714 2004-03-14 7 3783.0 0.001850 2004-05 0002111306 2.0
15 0 0 0 5 29 4.852941 0.354165 34 734.735294 2001-01-29 34 4923.0 0.006906 2014-03 0002113570 5.0
16 0 1 0 0 13 4.785714 0.772618 14 666.714286 2000-04-05 14 5222.0 0.002681 2000-05 0002117088 5.0
17 0 0 1 1 3 4.400000 0.800000 5 535.800000 2003-04-07 5 4125.0 0.001212 2003-04 0002153211 1.0
18 0 0 0 1 5 4.833333 0.372678 6 826.833333 2000-06-22 6 5144.0 0.001166 2000-06 0002154129 1.0
19 0 2 0 0 4 4.000000 1.414214 6 850.666667 2001-11-04 6 4644.0 0.001292 2001-11 0002154463 1.0
20 1 0 2 10 27 4.550000 0.804674 40 1281.775000 1999-03-10 40 5614.0 0.007125 2014-03 000215725X 5.0
21 0 0 1 2 2 4.200000 0.748331 5 1960.800000 2001-03-31 5 4862.0 0.001028 2001-03 0002157330 1.0
22 0 0 0 4 1 4.200000 0.400000 5 510.400000 2008-02-18 5 2347.0 0.002130 2013-07 0002158388 3.0
23 0 0 1 2 14 4.764706 0.545507 17 501.705882 2000-05-19 17 5178.0 0.003283 2000-06 000215949X 2.0
24 3 0 1 1 2 2.857143 1.726149 7 3127.857143 2000-10-08 7 5036.0 0.001390 2012-05 0002178559 2.0
25 0 0 0 3 2 4.400000 0.489898 5 1835.800000 2004-07-16 5 3659.0 0.001366 2004-07 0002179911 1.0
26 0 0 4 8 30 4.619048 0.652919 42 500.833333 2000-09-20 42 5054.0 0.008310 2013-11 0002185385 6.0
27 0 0 0 3 6 4.666667 0.471405 9 804.000000 1999-08-31 9 5440.0 0.001654 2006-04 000221685X 2.0
28 0 0 0 2 10 4.833333 0.372678 12 841.166667 2000-01-24 12 5294.0 0.002267 2011-02 0002216973 2.0
29 1 0 1 1 2 3.600000 1.496663 5 712.200000 2000-04-14 5 5213.0 0.000959 2000-04 000221735X 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
367953 0 0 0 3 6 4.666667 0.471405 9 400.444444 2014-07-18 9 5.0 1.800000 NaN B00LXARZCC NaN
367954 0 0 0 0 14 5.000000 0.000000 14 530.071429 2014-07-18 14 5.0 2.800000 NaN B00LXK4W2S NaN
367955 0 0 0 3 2 4.400000 0.489898 5 686.800000 2014-07-18 5 5.0 1.000000 NaN B00LXLA7WQ NaN
367956 0 0 2 7 5 4.214286 0.673856 14 1722.000000 2014-07-18 14 5.0 2.800000 NaN B00LXLACZS NaN
367957 0 0 0 0 8 5.000000 0.000000 8 572.000000 2014-07-18 8 5.0 1.600000 NaN B00LXM1MRY NaN
367958 0 1 1 5 6 4.230769 0.890449 13 535.923077 2014-07-19 13 4.0 3.250000 NaN B00LXNM46G NaN
367959 0 0 1 5 19 4.720000 0.530660 25 707.840000 2014-07-19 25 4.0 6.250000 NaN B00LXQIL06 NaN
367960 0 0 0 1 6 4.857143 0.349927 7 331.714286 2014-07-19 7 4.0 1.750000 NaN B00LY2DTG0 NaN
367961 0 0 2 3 82 4.919540 0.346357 87 611.758621 2014-07-19 87 4.0 21.750000 NaN B00LY51TPA NaN
367962 0 0 0 2 8 4.800000 0.400000 10 698.400000 2014-07-20 10 3.0 3.333333 NaN B00LYGQXQY NaN
367963 0 0 0 0 28 5.000000 0.000000 28 427.178571 2014-07-19 28 4.0 7.000000 NaN B00LYH0JD6 NaN
367964 0 0 0 2 22 4.916667 0.276385 24 1108.375000 2014-07-19 24 4.0 6.000000 NaN B00LYLHUNE NaN
367965 0 0 0 0 5 5.000000 0.000000 5 373.400000 2014-07-20 5 3.0 1.666667 NaN B00LYM5AUI NaN
367966 0 0 0 1 4 4.800000 0.400000 5 383.000000 2014-07-21 5 2.0 2.500000 NaN B00LYP00SW NaN
367967 1 0 1 1 17 4.650000 0.963068 20 661.650000 2014-07-20 20 3.0 6.666667 NaN B00LYPIQRY NaN
367968 0 0 0 0 7 5.000000 0.000000 7 318.285714 2014-07-20 7 3.0 2.333333 NaN B00LYPZIXO NaN
367969 0 0 0 4 34 4.894737 0.306892 38 1223.894737 2014-07-20 38 3.0 12.666667 NaN B00LYPZJ4M NaN
367970 0 0 0 0 5 5.000000 0.000000 5 352.400000 2014-07-21 5 2.0 2.500000 NaN B00LYXRD0C NaN
367971 0 0 0 0 7 5.000000 0.000000 7 870.285714 2014-07-20 7 3.0 2.333333 NaN B00LZ0ZKAE NaN
367972 0 0 0 10 34 4.772727 0.419070 44 1319.295455 2014-07-20 44 3.0 14.666667 NaN B00LZ0ZL1C NaN
367973 1 0 0 3 21 4.720000 0.825591 25 1185.960000 2014-07-20 25 3.0 8.333333 NaN B00LZ19FE0 NaN
367974 0 0 0 5 60 4.923077 0.266469 65 767.538462 2014-07-20 65 3.0 21.666667 NaN B00LZ4D0UC NaN
367975 0 0 0 1 4 4.800000 0.400000 5 570.800000 2014-07-21 5 2.0 2.500000 NaN B00LZ5SRTK NaN
367976 0 0 0 0 5 5.000000 0.000000 5 1871.000000 2014-07-21 5 2.0 2.500000 NaN B00LZ83VIO NaN
367977 0 0 0 0 7 5.000000 0.000000 7 865.571429 2014-07-20 7 3.0 2.333333 NaN B00LZ9OBWI NaN
367978 0 0 1 2 13 4.750000 0.559017 16 1872.875000 2014-07-21 16 2.0 8.000000 NaN B00LZFHL7Y NaN
367979 0 1 4 14 141 4.843750 0.468333 160 652.237500 2014-07-21 160 2.0 80.000000 NaN B00LZKMXBI NaN
367980 0 0 0 1 12 4.923077 0.266469 13 1431.923077 2014-07-21 13 2.0 6.500000 NaN B00M029T4O NaN
367981 0 0 0 1 29 4.966667 0.179505 30 781.033333 2014-07-22 30 1.0 30.000000 NaN B00M0RE7CS NaN
367982 0 0 0 1 6 4.857143 0.349927 7 301.142857 2014-07-23 7 0.0 inf NaN B00M13FNSS NaN

367983 rows × 16 columns

In [29]:
# Save statistics data to file
complete_books_df.to_csv("books_statistics_df.csv", index=False)

Plots

In [18]:
# Read statistics data from file
complete_books_df = pd.read_csv(
    "books_statistics_df.csv",
    parse_dates=['first_review_date', 'pop_metric_2_date'])

complete_books_df['pop_metric_2_date'] = pd.to_datetime(
    complete_books_df['pop_metric_2_date']).dt.to_period("M")
In [8]:
complete_books_df.plot(x = 'mean', y = 'pop_metric_1', kind = 'scatter', alpha = 0.3, figsize=(20,7))
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9f46441320>
In [10]:
complete_books_df[complete_books_df['days_since_first_review'] > 30].plot(
    x='mean', y='pop_metric_1', kind='scatter', alpha=0.3, figsize=(20, 7))
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9f46ffc0f0>
In [15]:
complete_books_df.plot(
    x='mean', y='pop_metric_2', kind='scatter', alpha=0.3, figsize=(20, 10))
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9f4330d1d0>

Categories

Reviews covered by books with categories

In [35]:
books_with_categories = pd.read_json("product_categories_enhanced.json", lines = True)
books_with_categories = books_with_categories.set_index('asin', drop = True)
books_with_categories.head()
Out[35]:
author brand categories keywords publisher release_date title
asin
030758836X Gillian Flynn Crown [[Books, Mystery, Thriller & Suspense, Thrille... [mystery fiction, wives, 030758836x, thrillers... Crown 2012-06-05 Gone Girl
0439023483 Suzanne Collins Scholastic Press [[Books, Children's Books, Science Fiction & F... [science fiction, scholastic press, survival, ... Scholastic Press 2008-09-14 The Hunger Games (The Hunger Games, Book 1)
0375831002 Markus Zusak Alfred A. Knopf [[Books, Children's Books, Literature & Fictio... [holocaust, the book thief, historical, orphan... Knopf Books for Young Readers 2006-03-14 The Book Thief
038536315X John Grisham John Grisham [[Books, Mystery, Thriller & Suspense, Thrille... [john grisham, political, suspense fiction, ra... Random House Large Print 2013-10-22 Sycamore Row
0439023513 Suzanne Collins Scholastic Press [[Books, Teens, Literature & Fiction, Social &... [science fiction, scholastic press, survival, ... Scholastic Press 2010-08-24 Mockingjay (The Hunger Games)
In [36]:
books_with_categories = pd.merge(
    books_with_categories,
    complete_books_df,
    left_index=True,
    right_on='asin',
    how='inner')
In [38]:
books_with_categories.sort_values('number_of_reviews', ascending=False, inplace=True)
In [39]:
books_with_categories
Out[39]:
author brand categories keywords publisher release_date title 1 2 3 ... mean Std_dev Avg_len first_review_date number_of_reviews days_since_first_review pop_metric_1 pop_metric_2_date asin pop_metric_2
29335 Gillian Flynn Crown [[Books, Mystery, Thriller & Suspense, Thrille... [mystery fiction, wives, 030758836x, thrillers... Crown 2012-06-05 Gone Girl 567 640 1209 ... 3.834543 1.250125 455.739785 2012-03-25 7440 850.0 8.752941 2013-03 030758836X 1564.0
68268 Suzanne Collins Scholastic Press [[Books, Children's Books, Science Fiction & F... [science fiction, scholastic press, survival, ... Scholastic Press 2008-09-14 The Hunger Games (The Hunger Games, Book 1) 129 134 344 ... 4.571684 0.835131 709.171952 2008-07-22 6717 2192.0 3.064325 2013-02 0439023483 981.0
54647 Markus Zusak Alfred A. Knopf [[Books, Children's Books, Literature & Fictio... [holocaust, the book thief, historical, orphan... Knopf Books for Young Readers 2006-03-14 The Book Thief 81 109 274 ... 4.588199 0.830168 482.175576 2006-02-16 4864 3079.0 1.579734 2014-02 0375831002 1882.0
58019 John Grisham John Grisham [[Books, Mystery, Thriller & Suspense, Thrille... [john grisham, political, suspense fiction, ra... Random House Large Print 2013-10-22 Sycamore Row 50 91 360 ... 4.501955 0.809504 276.146612 2013-10-22 4604 274.0 16.802920 2014-01 038536315X 2566.0
68270 Suzanne Collins Scholastic Press [[Books, Teens, Literature & Fiction, Social &... [science fiction, scholastic press, survival, ... Scholastic Press 2010-08-24 Mockingjay (The Hunger Games) 184 298 607 ... 4.131081 1.134602 720.243018 2010-03-03 4440 1603.0 2.769807 2013-03 0439023513 793.0
39571 Donna Tartt Little Brown and Company [[Books, Literature & Fiction, Literary], [Boo... [realization, fiction, 0316055433, donna tartt... Little, Brown and Company 2013-10-22 The Goldfinch: A Novel (Pulitzer Prize for Fic... 327 457 794 ... 3.787456 1.288250 493.768873 2013-09-24 4305 302.0 14.254967 2014-03 0316055433 1775.0
59290 Dan Brown Dan Brown [[Books, Literature & Fiction, Genre Fiction, ... [dan brown, dante alighieri, suspense fiction,... Doubleday 2013-05-14 Inferno 199 388 869 ... 3.866013 1.163871 432.973856 2013-05-14 4284 435.0 9.848276 2013-08 0385537859 2255.0
593 Veronica Roth imusti [[Books, Science Fiction & Fantasy, Science Fi... [harpercollins children's books, allegiant (di... HarperCollins Children's Books 2012-11-25 Allegiant (Divergent Trilogy) 593 450 644 ... 3.490709 1.461977 665.826485 2013-10-22 3821 274.0 13.945255 2013-12 0007444117 1627.0
209745 Colleen Hoover Atria Books [[Books, Literature & Fiction, Erotica, Romant... [atria books, 147674355x, fiction, new adult, ... Atria Books 2013-05-07 Hopeless 59 75 148 ... 4.660403 0.783968 499.815302 2012-12-18 3725 582.0 6.400344 2013-02 147674355X 2128.0
63687 Liane Moriarty Amy Einhorn Books [[Books, Literature & Fiction, World Literatur... [family life, psychological, contemporary wome... Berkley 2013-07-30 The Husband's Secret 85 138 405 ... 4.242134 0.954321 329.169357 2013-03-21 3655 489.0 7.474438 2013-11 0399159347 1184.0
46426 E L James Random House [[Books, Literature & Fiction, Women's Fiction... [e l james, vintage books, contemporary women,... Vintage Books 2012-04-03 Fifty Shades of Grey: Book One of the Fifty Sh... 1565 600 350 ... 2.419686 1.583406 811.590585 2006-11-03 3505 2819.0 1.243349 2012-07 0345803485 1168.0
18292 Kathryn Stockett imusti [[Books, Literature & Fiction, Literary]] [penguin canada, the help, 0141039280, kathryn... Penguin Canada 2010-11-25 The Help 73 63 158 ... 4.638825 0.826388 665.756912 2009-01-18 3472 2012.0 1.725646 2011-09 0141039280 444.0
208042 Hugh Howey [[Books, Literature & Fiction, Action & Advent... [science fiction, wool, action & adventure, dy... CreateSpace Independent Publishing Platform 2012-01-27 Wool - Omnibus Edition 47 62 171 ... 4.605263 0.796609 588.305882 2012-02-03 3230 901.0 3.584906 2013-05 1469984202 508.0
40516 Robert Galbraith Robert Galbraith [[Books, Mystery, Thriller & Suspense, Mystery... [, crime, mystery fiction, afghan war, robert ... Mulholland Books 2013-04-30 The Cuckoo's Calling (A Cormoran Strike Novel) 131 213 541 ... 3.970093 1.092759 411.543614 2013-04-15 3210 464.0 6.918103 2013-09 0316206849 1831.0
552 Laura Hillenbrand [[Books, History, Military, World War II]] [0007386648, unbroken, fourth estate, laura hi... Fourth Estate 2010-11-25 Unbroken 31 48 119 ... 4.724606 0.694862 554.956782 2010-09-24 3170 1398.0 2.267525 2014-05 0007386648 429.0
46427 E. L. James Random House [[Books, Literature & Fiction, Women's Fiction... [e l james, contemporary women, fiction, roman... Vintage 2012-04-17 Fifty Shades Darker 206 189 320 ... 4.184591 1.230284 385.346067 2011-09-11 3115 1046.0 2.978011 2013-02 0345803493 704.0
46428 E L James Random House [[Books, Literature & Fiction, Women's Fiction... [e l james, contemporary women, fiction, roman... Vintage 2012-04-17 Fifty Shades Freed: Book Three of the Fifty Sh... 229 177 335 ... 4.170613 1.262788 393.124554 2012-01-18 3083 917.0 3.362050 2013-02 0345803507 708.0
30115 John Grisham [[Books, Books on CD, Authors, A-Z, ( G ), Gri... [john grisham, the racketeer, fiction, suspens... Random House Audio 2012-10-23 The Racketeer 71 161 435 ... 4.161069 1.005760 285.959244 2012-10-23 3067 638.0 4.807210 2013-01 0307943232 1367.0
149831 Todd Burpo HarperCollins Christian Pub. [[Books, Christian Books & Bibles, Christian L... [christianity, religion & spirituality, religi... Thomas Nelson 2014-03-11 Heaven is for Real Movie Edition: A Little Boy... 201 108 232 ... 4.308034 1.190771 607.441709 2010-09-21 2925 1401.0 2.087794 2014-06 0849922070 753.0
150398 M. L. Stedman [[Books, Literature & Fiction, Contemporary], ... [the light between oceans, general & literary ... Transworld Publishers 2012-04-01 The Light Between Oceans 29 80 275 ... 4.381432 0.855455 382.428571 2012-04-14 2779 830.0 3.348193 2013-08 0857521012 639.0
11 Sara Gruen [[Books, Literature & Fiction, World Literatur... [sara gruen, water for elephants, g0002007770i... Harper Collins 2006-11-25 Water For Elephants 79 133 227 ... 4.352444 1.018211 591.721793 2006-05-25 2721 2981.0 0.912781 2011-05 0002007770 280.0
9556 Christina Baker Kline Christina Baker Kline [[Books, Literature & Fiction, Women's Fiction... [general & literary fiction, coming of age, fi... William Morrow 2013-04-02 Orphan Train 5 21 170 ... 4.618741 0.659883 354.702200 2013-02-01 2636 537.0 4.908752 2014-05 0061950726 1024.0
68041 Sylvia Day Penguin Group Usa [[Books, Romance, Holidays], [Books, Literatur... [adult & contemporary romance, fiction, book 3... Berkley 2013-06-04 Entwined with You (Crossfire, Book 3) 296 218 338 ... 3.865803 1.399412 528.681281 2013-06-04 2623 414.0 6.335749 2013-08 0425263924 2200.0
37502 Orson Scott Card Orson Scott Card [[Books, Science Fiction & Fantasy, Science Fi... [orson scott card, science fiction, ender (fic... Tor Books 1992-08-15 Ender's Game (The Ender Quintet) 67 76 225 ... 4.434382 0.949670 693.949348 1996-12-16 2606 6428.0 0.405414 2013-12 0312853238 557.0
198791 Eben Alexander M.D. Brand: Simon Schuster Audio [[Books, Religion & Spirituality, Occult & Par... [biography: science, simon & schuster audio, p... Simon & Schuster Audio 2012-10-23 Proof of Heaven: A Neurosurgeon's Near-Death E... 155 159 348 ... 4.077907 1.196788 532.924806 2012-10-23 2580 638.0 4.043887 2013-02 1442359315 814.0
198793 Jamie McGuire Brand: Simon n Schuster Audio [[Books, Books on CD, Romance], [Books, Books ... [144235948x, contemporary women, romance: mode... Simon & Schuster Audio 2012-08-14 Beautiful Disaster 119 130 190 ... 4.338950 1.112731 766.863636 2011-05-30 2552 1150.0 2.219130 2013-04 144235948X 427.0
198807 Stephen King Brand: Simon n Schuster Audio [[Books, Books on CD, Authors, A-Z, ( K ), Kin... [fiction, thrillers, horror & ghost stories, s... Simon & Schuster Audio 2013-09-24 Doctor Sleep: A Novel 46 62 139 ... 4.531051 0.852959 552.278264 2013-09-24 2512 302.0 8.317881 2013-12 1442362383 1418.0
20630 Stieg Larsson [[Books, Mystery, Thriller & Suspense, Thrille... [fiction, 0143170090, general, penguin canada,... Penguin Canada 2009-06-23 The Girl with the Dragon Tattoo 201 210 322 ... 3.895589 1.277243 881.389316 2008-01-14 2471 2382.0 1.037364 2010-09 0143170090 295.0
283907 A.G. Riddle Brand: Modern Mythology [[Books, Mystery, Thriller & Suspense, Thrille... [science fiction, suspense fiction, riddle inc... Riddle Inc. 2013-04-05 The Atlantis Gene: A Thriller (The Origin Myst... 84 116 277 ... 4.145902 1.024707 353.315574 2013-04-15 2440 464.0 5.258621 2014-02 1940026016 763.0
46241 William Landay Bantam [[Books, Mystery, Thriller & Suspense, Thrille... [bantam, thrillers, public prosecutors, myster... Bantam 2013-09-03 Defending Jacob: A Novel 48 125 272 ... 4.234227 0.974935 433.345567 2011-12-21 2425 945.0 2.566138 2013-06 0345533666 539.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
288246 Maia Underwood [[Kindle Store, Kindle eBooks, Literature & Fi... [romance, surviving passion (the shattered wor... 2010-02-07 Surviving Passion (The Shattered World Book 1) 2 2 9 ... 4.492424 0.811809 583.219697 2010-02-09 132 1625.0 0.081231 2011-06 B0037KMFJY 32.0
90989 Lisa Gardner Bantam [[Books, Deals in Books], [Books, Mystery, Thr... [0553588095, lisa gardner, psychological, miss... Bantam 2009-05-19 Say Goodbye: An FBI Profiler Novel 11 16 20 ... 3.765152 1.319205 762.431818 2008-07-16 132 2198.0 0.060055 2008-09 0553588095 38.0
43809 Bryce Courtenay Ballantine Books [[Books, Literature & Fiction, World Literatur... [the power of one: a novel, tv tie, 1939, 0345... Ballantine Books 1996-09-29 The Power of One: A Novel 3 3 9 ... 4.507576 0.891850 919.492424 1999-02-01 132 5651.0 0.023359 2013-08 034541005X 13.0
80167 Don Felder Don Felder [[Books, Arts & Photography, Music, Biographie... [biography and autobiography, biography, biogr... Wiley 2008-04-01 Heaven and Hell: My Life in the Eagles (1974-2... 4 2 14 ... 4.348485 0.945362 1195.272727 2008-04-13 132 2292.0 0.057592 2013-06 0470289066 18.0
222947 Samantha Young [[Books, Romance, Contemporary]] [romance, contemporary, createspace independen... CreateSpace Independent Publishing Platform 2014-04-02 Out of the Shallows (Into the Deep #2) (Volume 2) 2 13 15 ... 4.053030 1.024801 627.651515 2014-04-10 132 104.0 1.269231 2014-06 149744814X 128.0
187202 Cristin Terrill imusti [] [bloomsbury childrens books, 1408835193, all o... Bloomsbury Childrens Books 2013-11-25 All Our Yesterdays 1 5 8 ... 4.431818 0.827493 2064.484848 2013-08-01 132 356.0 0.370787 2013-10 1408835193 75.0
45339 Nikki Turner Turner, Nikki [[Books, Literature & Fiction, Genre Fiction, ... [african american, contemporary women, fiction... One World 2007-04-10 Forever a Hustler's Wife: A Novel (Nikki Turne... 9 9 27 ... 3.856061 1.213139 509.833333 2003-04-10 132 4122.0 0.032023 2007-06 0345493850 26.0
207150 Marie Force Brand: CreateSpace Independent Publishing Plat... [[Books, Romance, Contemporary]] [fiction, romance: modern, 1467938378, book 3,... CreateSpace Independent Publishing Platform 2011-12-18 Starting Over: Treading Water Series, Book 3 0 0 2 ... 4.878788 0.369896 378.712121 2011-12-18 132 948.0 0.139241 2013-04 1467938378 32.0
181019 C. J. Lyons PowerbookMedic [[Books, Mystery, Thriller & Suspense, Thrille... [serial murderers, c j lyons, american mystery... Minotaur Books 2012-07-31 Blind Faith: A Caitlyn Tierney FBI Thriller (S... 13 11 12 ... 4.022727 1.373261 441.984848 2010-10-07 132 1385.0 0.095307 2011-10 1250014603 50.0
150019 Richard Lee HarperCollins Christian Pub. [[Books, History, World, Religious, Religion, ... [religion and politics, united states, church ... Thomas Nelson 2012-01-30 The Coming Revolution: Signs from America's Pa... 6 16 18 ... 3.863636 1.179193 1856.674242 2012-01-13 132 922.0 0.143167 2012-04 0849948290 53.0
201488 John Ringo Brand: Baen [[Books, Science Fiction & Fantasy, Science Fi... [marines, adventure fiction, science fiction, ... Baen 2013-09-03 Under a Graveyard Sky 2 5 10 ... 4.454545 0.899188 637.318182 2013-08-23 132 334.0 0.395210 2013-11 1451639198 58.0
77545 Rob Thurman Roc [[Books, Literature & Fiction, Genre Fiction, ... [vampires, brothers, 0451460758, nightlife (ca... Ace 2006-03-07 Nightlife (Cal Leandros) 6 12 20 ... 4.000000 1.180652 1266.340909 2006-03-05 132 3062.0 0.043109 2006-05 0451460758 16.0
64737 Paul F. Wilson [[Books, Comics & Graphic Novels, Graphic Nove... [the keep, paul f wilson, berkley, 0425064409] Berkley 1983-11-01 The Keep 2 6 22 ... 4.166667 0.962688 988.121212 1998-05-18 132 5910.0 0.022335 2014-01 0425064409 17.0
18117 Robert Louis Stevenson Brand: Penguin Books, Limited (UK) [[Books, Textbooks, Humanities, Literature], [... [dr jekyll and mr hyde (penguin classics), 014... Penguin Books, Limited (UK) 1998-03-31 Dr Jekyll and MR Hyde (Penguin Classics) 4 2 15 ... 4.181818 0.911361 1095.681818 1998-04-06 132 5952.0 0.022177 2012-03 0140620516 9.0
197621 John Locke Brand: iUniverse [[Books, Mystery, Thriller & Suspense, Thrille... [extortion, action & adventure, espionage, fic... iUniverse 2009-12-11 Lethal Experiment: A Donovan Creed Novel 3 11 7 ... 4.287879 1.041054 539.409091 2010-03-24 132 1582.0 0.083439 2011-01 1440196249 34.0
20530 Robert Greene Robert Greene [[Books, Business & Money, Business Culture, M... [robert greene, assertiveness, successful peop... Penguin Books 2013-10-29 Mastery 5 7 6 ... 4.356061 1.045427 1272.916667 2012-09-28 132 663.0 0.199095 2013-01 014312417X 41.0
184240 Rorke Denver Brand: Hyperion [[Books, Biographies & Memoirs, Historical, Mi... [hachette books, rorke denver, middle east, de... Hachette Books 2014-01-14 Damn Few: Making the Modern SEAL Warrior 1 6 5 ... 4.598485 0.824436 595.181818 2013-02-15 132 523.0 0.252390 2013-05 1401312802 62.0
195599 Lisa T. Bergren David C Cook [[Books, Teens, Literature & Fiction, Religiou... [cascade: a novel (river of time series), ital... David C. Cook 2011-06-01 Cascade: A Novel (River of Time Series) 0 1 5 ... 4.704545 0.573911 1153.469697 2011-05-25 132 1155.0 0.114286 2011-08 1434764311 27.0
184197 Steve Luxenberg [[Books, Politics & Social Sciences, Social Sc... [steve luxenberg, personal memoirs, mothers an... Hachette Books 2010-05-11 Annie's Ghosts: A Journey into a Family Secret 0 6 13 ... 4.303030 0.825167 1495.325758 2009-04-25 132 1915.0 0.068930 2009-07 1401310192 60.0
63195 Barry Eisler G. P. Putnam's Sons [[Books, Mystery, Thriller & Suspense, Thrille... [american mystery & suspense fiction, rain, su... G. P. Putnam's Sons 2004-07-22 Rain Storm (John Rain Thrillers) 4 0 16 ... 4.378788 0.917261 607.136364 2004-07-24 132 3651.0 0.036154 2014-05 0399151923 31.0
209672 Jack E. Levin Brand: Threshold Editions [[Books, Biographies & Memoirs, Historical, Un... [, trenton, war, 1783, campaigns, 1476731934, ... Threshold Editions 2013-06-04 George Washington: The Crossing 9 7 15 ... 4.280303 1.251147 343.128788 2013-06-04 132 414.0 0.318841 2013-08 1476731934 86.0
6116 L. J. Smith [[Books, Teens, Romance, Science Fiction & Dys... [science fiction, harper, children's 12, horro... Harper 1999-02-03 Vampire Diaries #1: The Awakening (The Vampire... 12 17 20 ... 3.696970 1.331093 712.325758 1998-12-13 132 5701.0 0.023154 2013-02 0061020001 18.0
184032 Jennifer Donnelly Brand: Hyperion [[Books, Romance, Historical, 20th Century], [... [1914, fiction, 1918, historical, world war, u... Hyperion 2011-08-02 The Wild Rose 1 5 22 ... 4.265152 0.919979 817.492424 2011-05-29 132 1151.0 0.114683 2011-08 1401301045 34.0
92236 A. L. Zaun [[Books, Literature & Fiction, Literary], [Boo... [a l zaun publishing, adult & contemporary rom... A L Zaun Publishing 2013-05-18 The Do Over 6 6 13 ... 4.250000 1.096655 1148.659091 2013-05-19 132 430.0 0.306977 2013-07 0578124114 77.0
72354 Jeff Abbott Grand Central Publishing [[Books, Mystery, Thriller & Suspense, Thrille... [0446575178, espionage, fiction, kidnapping, c... Grand Central Publishing 2011-07-01 Adrenaline (The Sam Capra series) 2 6 19 ... 4.128788 0.924460 634.303030 2010-08-14 132 1439.0 0.091730 2011-09 0446575178 30.0
39504 Elin Hilderbrand [[Books, Literature & Fiction, Women's Fiction... [nantucket island (mass), 0316043893, family l... Little, Brown and Company 2009-07-07 The Castaways: A Novel 5 16 21 ... 3.825758 1.138155 827.659091 2009-07-07 132 1842.0 0.071661 2009-09 0316043893 34.0
274045 Catherine Cooper [[Books, Children's Books, Fairy Tales, Folk T... [fantasy & magic, greek & roman, myths, 190682... Infinite Ideas 2010-08-20 The Golden Acorn 6 5 15 ... 4.227273 1.084260 360.454545 2011-03-15 132 1226.0 0.107667 2013-02 1906821658 41.0
98960 Karen Tack Houghton Mifflin [[Books, Cookbooks, Food & Wine, Baking, Cakes... [alan richardson, baking, cupcakes, bread & ca... Rux Martin/Houghton Mifflin Harcourt 2008-04-24 Hello, Cupcake!: Irresistibly Playful Creation... 1 4 6 ... 4.643939 0.770012 499.363636 2008-04-05 132 2300.0 0.057391 2008-06 0618829253 14.0
47387 Maria V. Snyder Harlequin Teen [[Books, Teens, Literature & Fiction, Social &... [children: young adult (gr 10, maria v snyder,... Harlequin Teen 2010-03-23 Inside Out 0 2 15 ... 4.265152 0.716260 1578.863636 2010-03-20 132 1586.0 0.083228 2010-05 037321006X 34.0
219303 Todd M Thiede [[Books, Mystery, Thriller & Suspense, Mystery]] [todd m thiede, lies to die for (max larkin de... CreateSpace Independent Publishing Platform 2013-12-04 Lies to Die for (Max Larkin Detective Series) 5 0 4 ... 4.712121 0.848620 633.454545 2013-12-09 132 226.0 0.584071 2014-02 1492166472 90.0

8590 rows × 22 columns

In [24]:
sum(books_with_categories['Number_of_reviews'].values)
Out[24]:
2513320
In [158]:
# Unzipping list of lists of categories, extracting all categories to which book is assigned
all_categories = books_with_categories['categories'].apply(lambda x: 
        list(set([item for lists in x for item in lists])))
all_categories_frequency = dict(Counter(all_categories.sum()))
len(all_categories_frequency)
In [173]:
# Most popular categories
sorted_all_categories_frequency = sorted(all_categories_frequency.items(), key=operator.itemgetter(1), reverse=True)
sorted_all_categories_frequency[:100]

Data spliting into good vs. bad and controversial vs. non-controversial

K-Means to cluster datapoints

In [106]:
X = ratings_grouped_unstack_df[['Number_of_reviews', 'Weighted_mean', 'Std_dev', 'Avg_len']].values
range_n_clusters = [2, 3, 4, 5, 6, 7, 8, 9, 10]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, 15000 + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.  
    clusterer = MiniBatchKMeans(init='random', n_clusters=n_clusters, batch_size=40000,
                      n_init=10, max_no_improvement=10, verbose=0, random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels, sample_size=10000)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    X_sample, cluster_labels_sample = resample(X, cluster_labels, replace = False, n_samples = 15000)
    sample_silhouette_values = silhouette_samples(X_sample, cluster_labels_sample)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels_sample == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        print(size_cluster_i)
        y_upper = y_lower + size_cluster_i

        color = cm.spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.spectral(cluster_labels_sample.astype(float) / n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

    plt.show()
For n_clusters = 2 The average silhouette_score is : 0.605456255789
4401
10599
For n_clusters = 3 The average silhouette_score is : 0.565506575443
7945
1564
5491
For n_clusters = 4 The average silhouette_score is : 0.549236961701
4995
2748
6565
692
For n_clusters = 5 The average silhouette_score is : 0.52430263359
4181
486
3407
2143
4783
For n_clusters = 6 The average silhouette_score is : 0.51741134872
3297
2066
4316
1005
4031
285
For n_clusters = 7 The average silhouette_score is : 0.497772266529
2599
1821
3254
925
3102
277
3022
For n_clusters = 8 The average silhouette_score is : 0.504782891189
54
1674
3276
949
3099
288
3007
2653
For n_clusters = 9 The average silhouette_score is : 0.47794246451
79
1746
2350
904
2503
242
2749
2628
1799
For n_clusters = 10 The average silhouette_score is : 0.43890444645
2025
255
2113
1572
1345
1479
1535
2076
1790
810

Sentiment analysis

Data collection using Apache Spark and VADER model (nltk)

In [ ]:
# Initialization of VADER model
sid = SentimentIntensityAnalyzer()
In [ ]:
# Create Spark context if it is not running
sc = SparkContext()
sqlContext = SQLContext(sc)
In [ ]:
# Loading file with reviews
text_file = sc.textFile("/media/adam/B236CB1D36CAE209/Studia/ADA/reviews_Books_5.json")
df = sqlContext.read.json(text_file)
In [ ]:
# Data transformation
df = df.withColumn('unixReviewTime', from_unixtime(df['unixReviewTime']))
df = df.withColumn('reviewTime', to_date(df['unixReviewTime']))
df = df.withColumn('unixReviewTime', df['unixReviewTime'].cast('timestamp'))
In [ ]:
# Function to apply on each row of the data - compute the sentiment score for each review
def sentiment_f(row):
    rew = sid.polarity_scores(row.reviewText)
    summ = sid.polarity_scores(row.summary)
    return (row.asin, row.unixReviewTime, row.reviewerID, row.overall,
            rew['compound'], rew['neg'], rew['neu'], rew['pos'],
            summ['compound'], summ['neg'], summ['neu'], summ['pos'])
In [ ]:
# Transform and collect the data about sentiment
sentiment = df.select("asin", "reviewText", "summary", "unixReviewTime",
                      'overall', "reviewerID").rdd.map(sentiment_f).collect()
In [19]:
sentiment_df = pd.DataFrame(sentiment, columns=['asin', 'unixReviewTime', 'reviewerID', 'overall', 'rew_compound', 'rew_neg', 'rew_neu', 'rew_pos',
           'summ_compound', 'summ_neg', 'summ_neu', 'summ_pos'])
sentiment_df.head()
Out[19]:
asin unixReviewTime reviewerID overall rew_compound rew_neg rew_neu rew_pos summ_compound summ_neg summ_neu summ_pos
0 B009UMF4KA 2013-02-20 01:00:00 A24RY39KA1P8K0 1.0 -0.9293 0.275 0.668 0.056 0.0000 0.000 1.000 0.000
1 B009UMF4KA 2013-02-18 01:00:00 AA1HAXGZ8A6OZ 3.0 0.8159 0.042 0.841 0.117 0.0000 0.000 1.000 0.000
2 B009UMF4KA 2013-06-30 02:00:00 A31EPS36AHUBF9 5.0 0.9020 0.000 0.591 0.409 0.4588 0.000 0.571 0.429
3 B009UMF4KA 2013-01-30 01:00:00 A33LSPIIUUBVBY 4.0 0.4927 0.145 0.670 0.185 0.4404 0.000 0.256 0.744
4 B009UMF4KA 2013-03-17 01:00:00 A20EFEJWLMIXRT 3.0 -0.6887 0.100 0.839 0.061 -0.4019 0.383 0.426 0.191
In [ ]:
# Save collected data
sentiment_df.to_csv("sentiment.csv", index=False)
In [ ]:
# Stop Spark context
sc.stop()

Data analysis

By the previous look at the data, we drew the conclusion that we would focus on the negative and positive ratings. We will therefore, for now, focus on the file giving the number of 1,2,3,4 and 5 stars ratings for each book.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [3]:
#create a dataframe with the ratings for each books and the length and total number or reviews
reviews_df_raw=pd.DataFrame.from_csv('ratings_grouped_unstack.csv')
d1=pd.DataFrame.from_csv('avg_length_and_number_of_reviews_per_book.csv')
In [4]:
reviews_df_raw.head()
Out[4]:
1 2 3 4 5
asin
000100039X 6 4 8 15 173
0001055178 0 4 2 10 2
0001473123 1 0 0 2 13
0001473727 0 0 0 0 7
0001473905 0 0 1 0 5

In this dataset, asin is the Amazon identifier for the book, and the column named 1 (respectively 2, 3, 4 and 5) corresponds to how many reviews were associated with a rating of 1 star (respectively 2, 3, 4 and 5 stars) (every time someone reviews a book they need to rate the book from 1 star to 5).

We can start by plotting the total number of ratings for each number of stars.

In [5]:
x = reviews_df_raw.columns
y = reviews_df_raw.sum()

plt.plot(x,y)
plt.title('Total number of 1 to 5 stars ratings reviews')
plt.ylabel('Number of reviews')
plt.xlabel('Number of stars associated to the review')
plt.xticks([1,2,3,4,5])
Out[5]:
([<matplotlib.axis.XTick at 0x264122a1668>,
  <matplotlib.axis.XTick at 0x264122ac630>,
  <matplotlib.axis.XTick at 0x26412276470>,
  <matplotlib.axis.XTick at 0x264122fd940>,
  <matplotlib.axis.XTick at 0x26412301438>],
 <a list of 5 Text xticklabel objects>)

Let's try to plot the logarithm of y for this plot to see if we can find something interesting.

In [6]:
y2=np.log(y)
g=plt.plot(x,y2)

plt.title('Logarithm of the total number of 1 to 5 stars ratings reviews')
plt.ylabel('log(number of reviews)')
plt.xlabel('Number of stars associated to the review')
plt.xticks([1,2,3,4,5])
Out[6]:
([<matplotlib.axis.XTick at 0x2641233c860>,
  <matplotlib.axis.XTick at 0x26412355fd0>,
  <matplotlib.axis.XTick at 0x26412234240>,
  <matplotlib.axis.XTick at 0x2641239f198>,
  <matplotlib.axis.XTick at 0x2641239fc50>],
 <a list of 5 Text xticklabel objects>)

From 2 to 5 stars the logarithm of y gives a straight line! This means that people would give exponentially more 5 stars than 2 stars rating. This is a very interesting discovery that will be needed to be questionned further, by, for example, checking if in the Amazon rules concerning reviews something could explain this.

We will now try to find "good books" , "controversial books" and "bad books" defined as follow:

  • "good book"= many high ratings hew low ratings
  • "bad book" = many low ratings few high ratings
  • "controversial book" = any kind of ratings with small variance between number of reviews with 1,2,3,4 or 5 stars
In [13]:
# merge the two datasets and calculate the "bad ratio" and "good_ratio" for each book
#bad_ratio and good_ratio are defined as the number of 1,2 stars (resp 4,5 stars) divided 
#by the total number of reviews 

d1['asin']=d1.index
reviews_df_raw['asin']=reviews_df_raw.index
reviews_df_tot=pd.merge(reviews_df_raw, d1, on=['asin'])
reviews_df_tot['bad_ratio']= (reviews_df_tot['1']+reviews_df_tot['2'])/(reviews_df_tot['1']+reviews_df_tot['2']+reviews_df_tot['3']+reviews_df_tot['4']+reviews_df_tot['5'])
reviews_df_tot['good_ratio']= (reviews_df_tot['4']+reviews_df_tot['5'])/(reviews_df_tot['1']+reviews_df_tot['2']+reviews_df_tot['3']+reviews_df_tot['4']+reviews_df_tot['5'])


reviews_df_tot.head()
Out[13]:
1 2 3 4 5 asin Avg_len number_of_reviews bad_ratio good_ratio
0 6 4 8 15 173 000100039X 638.033981 206 0.048544 0.912621
1 0 4 2 10 2 0001055178 1308.611111 18 0.222222 0.666667
2 1 0 0 2 13 0001473123 287.937500 16 0.062500 0.937500
3 0 0 0 0 7 0001473727 187.571429 7 0.000000 1.000000
4 0 0 1 0 5 0001473905 481.833333 6 0.000000 0.833333

For our analysis to make sense, we need to take books that have a significative number of reviews as if the book only has one or two reviews that is negative/positive, the book will be directly classified as a bad/good book and this is not reliable. However, if we take a minimum too high for the number of reviews of a book, we will have very few "bad books" as, we will see that the bulk of bad books have very few comments.

We will try with a limit of 10 reviews.

In [14]:
reviews_df = reviews_df_tot.copy()[reviews_df_tot.number_of_reviews>10]
In [15]:
# as a first guess, we can define bad books as having a bad ratio higher than 0.7 
#(respectively lower than 0.7 for good books)


#we will compare how many bad books we have if we consider the all dataset and if we consider only the books that have more than 10 reviews (for reliability)
bad_books_tot = reviews_df_tot[reviews_df_tot.bad_ratio>0.7]
bad_books = reviews_df[reviews_df.bad_ratio>0.7]

#As we have already a significant number of good and controversial books with our limit, we just consider with the limit 
good_books = reviews_df[reviews_df.good_ratio>0.7]
controversial_books= reviews_df[abs(reviews_df.good_ratio-reviews_df.bad_ratio)<0.1]
In [16]:
bad_books_tot.describe()
Out[16]:
1 2 3 4 5 Avg_len number_of_reviews bad_ratio good_ratio
count 936.000000 936.000000 936.000000 936.000000 936.000000 936.000000 936.000000 936.000000 936.000000
mean 5.973291 1.967949 0.591880 0.475427 0.849359 852.246335 9.857906 0.816833 0.123201
std 16.132484 2.010411 1.105213 0.930433 4.037758 577.108054 21.163739 0.085686 0.095057
min 0.000000 0.000000 0.000000 0.000000 0.000000 116.666667 5.000000 0.704545 0.000000
25% 3.000000 1.000000 0.000000 0.000000 0.000000 449.900000 5.000000 0.750000 0.000000
50% 4.000000 2.000000 0.000000 0.000000 0.000000 710.778333 7.000000 0.800000 0.142857
75% 6.000000 3.000000 1.000000 1.000000 1.000000 1128.700000 9.000000 0.843117 0.200000
max 447.000000 20.000000 14.000000 12.000000 119.000000 6498.200000 582.000000 1.000000 0.294118
In [17]:
bad_books.describe()
Out[17]:
1 2 3 4 5 Avg_len number_of_reviews bad_ratio good_ratio
count 185.000000 185.000000 185.000000 185.000000 185.000000 185.000000 185.000000 185.000000 185.000000
mean 15.794595 3.372973 1.502703 1.140541 2.405405 842.318764 24.216216 0.796491 0.134620
std 34.482998 3.216183 1.981228 1.671933 8.847929 440.475755 44.836257 0.077958 0.076236
min 2.000000 0.000000 0.000000 0.000000 0.000000 147.909091 11.000000 0.704545 0.000000
25% 7.000000 1.000000 0.000000 0.000000 0.000000 539.714286 12.000000 0.733333 0.076923
50% 10.000000 3.000000 1.000000 1.000000 1.000000 768.315789 15.000000 0.777070 0.138889
75% 16.000000 5.000000 2.000000 2.000000 2.000000 1090.764706 21.000000 0.833333 0.187500
max 447.000000 20.000000 14.000000 12.000000 119.000000 2494.909091 582.000000 1.000000 0.294118

As we said before, the number of 'bad books' depends a lot on the limit we take for the minimum number of reviews of our dataset. This might be due to the fact that when books have some bad comments, the books popularity drops and people stop reviewing them very fast.

In [18]:
good_books.describe()
Out[18]:
1 2 3 4 5 Avg_len number_of_reviews bad_ratio good_ratio
count 142711.000000 142711.000000 142711.000000 142711.000000 142711.000000 142711.000000 142711.000000 142711.000000 142711.000000
mean 0.954432 1.410024 3.788601 10.963324 26.448417 847.175892 43.564799 0.049209 0.870685
std 3.312660 3.970387 9.158551 22.480132 59.200941 491.345135 92.935312 0.053780 0.084861
min 0.000000 0.000000 0.000000 0.000000 0.000000 121.909091 11.000000 0.000000 0.700315
25% 0.000000 0.000000 1.000000 3.000000 9.000000 464.714286 14.000000 0.000000 0.806452
50% 0.000000 0.000000 2.000000 6.000000 13.000000 749.948718 22.000000 0.038462 0.875000
75% 1.000000 1.000000 4.000000 11.000000 25.000000 1117.581144 40.000000 0.080000 0.934783
max 229.000000 298.000000 607.000000 1271.000000 4839.000000 5850.666667 6717.000000 0.296296 1.000000
In [19]:
controversial_books.describe()
Out[19]:
1 2 3 4 5 Avg_len number_of_reviews bad_ratio good_ratio
count 2810.000000 2810.000000 2810.000000 2810.000000 2810.000000 2810.000000 2810.000000 2810.000000 2810.000000
mean 6.190747 5.680783 6.414591 5.698221 6.730961 972.133560 30.715302 0.385675 0.402699
std 12.073168 10.577229 11.582232 12.215638 13.130457 554.989698 57.239637 0.066117 0.063787
min 0.000000 0.000000 0.000000 0.000000 0.000000 136.357143 11.000000 0.090909 0.153846
25% 2.000000 2.000000 2.000000 2.000000 3.000000 553.401786 13.000000 0.347826 0.363636
50% 4.000000 4.000000 4.000000 3.000000 4.000000 908.617908 17.000000 0.384615 0.406250
75% 6.000000 6.000000 7.000000 6.000000 7.000000 1270.751786 30.000000 0.428571 0.454545
max 359.000000 373.000000 397.000000 464.000000 457.000000 4822.833333 2050.000000 0.545455 0.548387

We can already notice that there are way less "bad books" according to our definition and it is very dependent of the limit that we impose for the minimum reviews taken into account in our dataset. This needs to be analysed further.

As there are many outliers in our dataset, we will plot only robust variables. The first function defined below plots boxplots with outliers and the second one without the outliers.

In [20]:
def plot_categories(column):
    # Create a figure with 2 subplots and set the title
    fig, ax = plt.subplots(figsize=(15,12), ncols=1, nrows=3)
    plt.suptitle("Boxplots of "+ column + " for three categories of books", fontsize=18)


    # Adjust the sub plots parameters
    hspace =  0.5   

    plt.subplots_adjust(
      hspace  =  hspace
    )


    # Set a title to each graph
    ax[0].set_title("Bad books", fontsize=15)
    ax[1].set_title("Good books", fontsize=15)
    ax[2].set_title("Controversial Books", fontsize=15)

    # Plot the data

    sns.boxplot( bad_books[column], ax=ax[0])
    sns.boxplot( good_books[column], ax= ax[1])
    sns.boxplot( controversial_books[column], ax=ax[2])

    plt.show()
In [21]:
def plot_categories_no_outliers(column):
    # Create a figure with 2 subplots and set the title
    fig, ax = plt.subplots(figsize=(15,12), ncols=1, nrows=3)
    plt.suptitle("Boxplots of "+ column + " for three categories of books", fontsize=18)


    # Adjust the sub plots parameters
    hspace =  0.5   

    plt.subplots_adjust(
      hspace  =  hspace
    )


    # Set a title to each graph
    ax[0].set_title("Bad books", fontsize=15)
    ax[1].set_title("Good books", fontsize=15)
    ax[2].set_title("Controversial Books", fontsize=15)

    # Plot the data

    sns.boxplot( bad_books[column], ax=ax[0], showfliers=False)
    sns.boxplot( good_books[column], ax= ax[1], showfliers=False)
    sns.boxplot( controversial_books[column], ax=ax[2], showfliers=False)

    plt.show()
In [22]:
plot_categories('Avg_len')
In [23]:
plot_categories_no_outliers('Avg_len')

The average length of the reviews is pretty well balanced through the groups. However, there are way more outliers (longer reviews) in the "good books" and "controversial books" than in the "bad books". The median for controversial books also seems to be a bit higher than for the two other categories. This could be explained by the fact that when people don't agree with each other they are more willing to write longer comments to explain their point of view. However, we see that we need to use more reliable statistics that the bad_ratio and good_ratio that we made up before to be able to draw conclusions.

What we can do now is to try to refine our definitions for the books categories, and to do that we can try to consider mean and standard deviation of the ratings to have maybe more interpretable results.

We can calculate the mean and the standard deviation of ratings for each book. We can use these not robust statistics here as the ratings are bound between 1 and 5 so there are no outliers.

In [34]:
# start again with a copy of our first dataset 
reviews_df_2 = reviews_df_tot.copy()
del reviews_df_2['bad_ratio']
del reviews_df_2['good_ratio']

reviews_df_2.head()
Out[34]:
1 2 3 4 5 asin Avg_len number_of_reviews
0 6 4 8 15 173 000100039X 638.033981 206
1 0 4 2 10 2 0001055178 1308.611111 18
2 1 0 0 2 13 0001473123 287.937500 16
3 0 0 0 0 7 0001473727 187.571429 7
4 0 0 1 0 5 0001473905 481.833333 6
In [35]:
# calculate the mean for each book 

reviews_df_2 ['mean_rat'] = (reviews_df_2['1']*1+reviews_df_2['2']*2+reviews_df_2['3']*3+reviews_df_2['4']*4+reviews_df_2['5']*5)/reviews_df_2.number_of_reviews
reviews_df_2.head()
Out[35]:
1 2 3 4 5 asin Avg_len number_of_reviews mean_rat
0 6 4 8 15 173 000100039X 638.033981 206 4.674757
1 0 4 2 10 2 0001055178 1308.611111 18 3.555556
2 1 0 0 2 13 0001473123 287.937500 16 4.625000
3 0 0 0 0 7 0001473727 187.571429 7 5.000000
4 0 0 1 0 5 0001473905 481.833333 6 4.666667
In [36]:
#Plot a histogram 
fig, ax = plt.subplots(figsize=(15,10), ncols=1, nrows=1)

reviews_df_2.mean_rat.hist(bins=np.linspace(1,5,50), rwidth=0.9)




plt.title('Distribution for the mean ratings of a book')
plt.xlabel('Mean rating')
plt.ylabel('Number of books')
plt.xticks([1,2,3,4,5])
Out[36]:
([<matplotlib.axis.XTick at 0x223cea3a978>,
  <matplotlib.axis.XTick at 0x223d14fbdd8>,
  <matplotlib.axis.XTick at 0x223d14e7668>,
  <matplotlib.axis.XTick at 0x223d4764d30>,
  <matplotlib.axis.XTick at 0x223d476c828>],
 <a list of 5 Text xticklabel objects>)

We see that the distribution is completely left skewed and we will try to recentre it between 0 and 1 using this article http://rogeriofvieira.com/wp-content/uploads/2016/05/Data-Transformations-1.pdf.

From this article, we see that we can try two transformations:

  • Logarithms : Growth rates are often exponential and log transforms will often normalize them. Log transforms are particularly appropriate if the variance increases with the mean.
  • Reciprocal : If a log transform does not normalize your data you could try a reciprocal (1/x) transformation. This is often used for enzyme reaction rate data.
In [48]:
#Take the transformated weigthed mean as explained in the article 
reviews_df_2['Trans_weighted_mean'] = np.log10(6 - reviews_df_2['mean_rat'].values)
In [38]:
reviews_df_2.plot(y = 'Trans_weighted_mean', kind = 'hist', bins = 50, logx=False)
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x223d4781cf8>
In [41]:
reviews_df_2['Inverse_weighted_mean'] = 1/(6 - reviews_df_2['mean_rat'].values)
In [42]:
reviews_df_2.plot(y = 'Inverse_weighted_mean', kind = 'hist', bins = 50, logx=False)
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x223ce957a20>

We see that this doesn't give very smooth plots and this is due to the books that have just a few ratings and therefore will have values completely equal to 5 for example. As before, we will consider only books that have more than 10 reviews.

In [43]:
reviews_df_sm = reviews_df_2.copy()[reviews_df_tot.number_of_reviews>10]
In [45]:
reviews_df_sm.plot(y = 'mean_rat', kind = 'hist', bins = 100, logx=False)
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x223ce9e50f0>
In [46]:
reviews_df_sm.plot(y = 'Trans_weighted_mean', kind = 'hist', bins = 50, logx=False)
Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x223d19f7b00>
In [47]:
reviews_df_sm.plot(y = 'Inverse_weighted_mean', kind = 'hist', bins = 50, logx=False)
Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x223dd515358>

Now the graphs look way better. We can then refine our definitions for "good", "bad" and also add "neutral" books as following:

  • "bad" book = the books in the 25% on the left of this plot
  • "neutral" book = the books in the 50% in the middle of this plot
  • "good" book = the books in the 25% on the right of this plot

We will also analyse the standard deviation to define what is a "controversial" book.

After those definitions are set, we can try to see if they would match with a sentimental analysis model like Vader: Vader can predict if a sentence is negative or positive and we could see if a review is negative and positive and compare this to the number of stars it is related to.

In [ ]:
# Spark context shutdown
sc.stop()

Roadmap for the future

Our internal milestones up until the report are:

  • Decide all models of grouping data into categories (negative/positive, fast/slow plateauing of review numbers, etc.)

  • Decide parameters and thresholds for category definitions based on the data

  • Enrich data by scraping Amazon for genres as extensively as possible

  • Create necessary intermediate files containing formatted data, to speed up analysis and computing

  • Find the best way to use VADER to analyze positivity and negativity of reviews

  • Calculate correlations between popularity metrics and positive and negative reviews and ratings

    • With respect to time
    • By aggregation by category (good, bad, controversial)
    • By aggregation by genre and author
  • If necessary, run analysis on the cluster

  • Find the exact data story we want to tell (choose exact questions), make it compelling, and link it to social good

  • Decide format of report

  • Depending on report format, distribute tasks to individual group members

  • Produce good, polished visualizations that tell the story

  • Construct report into a coherent document

  • Update readme according to progress

  • Create a plan for presentation and poster creation

The enrichment of the data will be done using the Amazon Scraper.ipynb notebook, which contains a description of its function (please have a look).

In order to use VADER, we will have to decide how to apply its sentence-based logic to entire reviews. The obvious way is to analyze each sentence and take the mean score, but we will look into whether this is the most sensible way or if another way would be more relevant.

In the future, these and many more notebook cells will be filled:

In [16]:
# Retrieve sentiments of reviews:
# TODO...
In [17]:
# Calculate correlations:
# TODO...
In [18]:
# Produce final visualizations:
# TODO...
In [ ]: